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The  Center  for  the  Information  Sciences  has  developed  and  main¬ 
tains  an  experimental  system  for  the  literature  of  the  information 
sciences.  At  present  the  collection  contains  about  2,500  documents  and 
is  used  for  instruction,  reference,  research  and  experimentation. 

Documents  are  indexed  manually  and  a  coordinate  index  system  is 
used  with  a  controlled  thesaurus.  Posting,  up-dating,  author  listings, 
and  both  associative  and  non-associative  searches  are  performed  on  the 
GE-225  computer .  On-line  access  is  facilitated  through  a  Datanet-15 
and  a  MOD-33  ASR  Teletype  in  the  Center. 

In  addition,  a  growing  collection  of  natural  .language  text  on  tape 
is  maintained  for  automatic  indexing  and  abstracting  studies. 

This  series  of  studies  reports  experimentation  and  research  on 
this  operating  system. 
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AN  ASSOCIATIVITY  Tt>HI<*lE  FOR  AUTOMATICALLY 
OPTIMIZING  RETRIEVAL  RESULTS 


by 


Ronald  R.  Anderson 


Abstract 


An  experiment  is  described  which  evaluates  the  effectiveness  of  an 
associative  search  technique  for  automatically  optimizing  retrieval  re¬ 
sults.  Originally,  the  design  of  the  experiment  called  for  testing  an 
existing  automatic  retrieval  method  by  using  the  associativity  formula 
in  the  Center  for  the  Information  Sciences  (CIS)  document  retrieval 
system.  However,  during  the  early  stages  of  the  experiment,  it  became 
apparent  an  adjustment  was  necessary  if  the  automatic  technique  was  to 
remain  effective.  Such  a  modification  was  made,  resulting  in  an 
associative  technique  which  would  enable  the  CIS  system  to  both  auto¬ 
matically  expand  ;tnd  automatically  narrow  the  number  of  documents 
retrieved  by  an  initial  search  request.  Thertris,  to  retrieve  documents 
related  to  a  request  even  though  they  may  not  be  indexed  by  the  exact 
terms  of  the  request,  and  to  satisfy  the  user's  depth-of -search  require¬ 
ment  by  presenting  the  documents  retrieved  in  the  order  of  their 
relevance  to  the  request. 
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A.  Introduction 


During  the  past  few  years,  statistical  measures  of  word  association 
have  become  a  common  search  tool  in  automatic  information  retrieval 
systems.  These  statistical  measures  are  derived  from  formulas  which 
attempt  to  correlate  two  given  index  terms  on  the  basis  of  their  use 
frequencies  and  their  frequencies  of  co-occurrence  in  the  documents  of 
a  given  collection. 

The  basis  for  using  statistical  techniques  is  the  assumption  that 
the  assignment  of  two  iniex-terms  to  a  given  document  may  be  interpreted 
as  a  small  piece  of  probabilistic  evidence  that  the  two  terms  are  cor¬ 
related.  That  is,  the  document  is  assumed  to  describe  a  relationship 
between  the  topics  denoted  by  the  two  index  terms.  Presumably,  by 
accumulating  such  small  pieces  of  evidence  from  a  large  collection  of 
documents,  it  is  possible  to  arrive  at  a  meaningful  over-all  measure  of 
association  for  any  given  pair  of  index  terms. 

The  use  of  these  statistical  methods  is  valuable  because  they 
consider  the  connections  and  relationships  among  topics  which  are  par¬ 
ticular  to  the  given  document  collection.  In  addition,  they  not  only 
state  whether  an  index  term  is  related  to  another  one,  but  also  provide 
an  estimate  of  how  closely  related  it  is. 

In  the  second  edition  of  Centralization  and  Documentation  [13, 

Arthur  D.  Little,  Inc.  states  that  the  capability  to  determine  measures 
of  association  among  index  terms  and  documents  leads  to  the  potential 
ability  to  design  systems  possessing  the  following  highly  desirable 
capabilities,  which  are  not  available  in  many  existing  coordinate 
searching  systems : 

1.  A  capability  for  automatically  generalizing  a  user's 
request  to  make  it  more  compatible  with  the  vocabulary 
of  the  retrieval  system. 

2.  A  capability  for  automatically  matching  the  user's 
depth -of -search  requirement  to  system  parameters,  by 
ranking  the  documents  presented  to  users  in  decreasing 
order  of  probable  relevance. 

3.  A  potential  capability  for  nearly  instantaneous  inter¬ 
action  between  user  and  searching  machine,  without  the 
need  for  a  human  intermediary. 


The  Center  for  the  Information  Sciences  automatic  document  retriev¬ 
al  system,  although  making  use  of  an  associative  search  technique,  does 
not  fully  exploit  these  capabilities. 

The  first  capability  exists  in  part,  but  manual  intervention  is 
necessary  before  the  generalised  request  can  be  used.  A  procedure  has 
been  implemented  [2]  which  presents  the  user  with  a  list  of  index  terms 
associated  with  the  documents  retrieved  by  his  search  request,  the 
number  of  times  each  of  these  terms  has  been  used  in  indexing  the  docu¬ 
ments  in  the  collection,  the  number  of  times  these  terms  index  the 
retrieved  documents,  and  an  association  coefficient  for  each  term  based 
on  the  results  of  the  search.  Using  this  information,  as  well  as  any 
familiarity  he  might  have  with  the  contents  of  the  collection  and  with 
the  general  indexing  strategy,  the  user  is  able  to  estimate  with  rea¬ 
sonable  success  how  many  documents  he  would  receive  if  he  were  to 
conduct  a  proposed  search.  Employing  this  technique,  he  is  able  in 
many  cases  to  optimize  his  retrieval  results  by  manually  reformulating 
his  original  request. 

The  second  capability  does  not  exist  at  all  in  the  CIS  system,  al¬ 
though  a  ranked  list  of  the  documents  retrieved  would  be  beneficial  to 
the  user,  especially  when  his  initial  search  request  has  produced  more 
documents  than  he  can  use. 

The  third  capability  is  satisfied  in  the  CIS  system  when  the  search¬ 
ing  is  performed  in  an  on-line  environment. 

The  lack  of  fulfillment  of  the  first  two  capabilities,  together 
with  the  success  of  existing  associative  methods  and  the  cry  for  more 
experiments  on  associative  searching  [3]  led  to  the  work  described  in 
this  report.  It  was  felt  that  by  supplementing  the  existing  associa¬ 
tive  procedures  with  an  effective  automatic  technique,  the  full 
potential  offered  by  the  use  of  associative  searching  in  the  CIS  system 
could  be  more  nearly  achieved .  The  user  would  then  have  the  option  of 
attempting  to  optimize  his  retrieval  results  through  either  semi¬ 
automatic  or  fully  automatic  means. 

B  .  Design  of  Experiment 


In  performing  the  experiment,  the  CIS  associative  search  program 
was  utilised.  For  each  associative  search,  this  program  produces  a 


terra  profile,  a  list  of  index  terras  associated  with  the  search  term. 

The  association  between  an  index  terra  and  the  search  term  is  calculated 
as  follows: 

ASSOCIATIVITY  COEFFICIENT  = _ _ _  where 

f(a)  <  f(b) 

f(a)  =  total  occurrence  of  the  profile  term 
in  the  document  collection 

f(b)  =  total  occurrence  of  the  search  term  in 
the  document  collection 

f(ab)  =  co-occurrence  of  the  profile  terra  with 
the  search  term. 

The  terra  profile  for  the  search  term  "ARTIFICIAL  INTELLIGENCE "  is  given 
in  Figure  1. 

When  the  search  is  composed  of  two  or  more  terms  linked  by  the 
logical  Boolean  operations  of  disjunction  (v),  conjunction  (+),  and 
negation  (-),  the  associativity  coefficient  is  still  calculated  as 
though  the  search  was  composed  of  one  term.  That  is,  f(b)  is  equal  to 
the  number  of  documents  retrieved  by  the  search  and  f(ab)  i3  the  co¬ 
occurrence  of  the  profile  term  in  those  documents.  The  term  profile 
for  the  search  "DISSEMINATION  +  INFORMATION"  is  given  in  Figure  2. 

Eight  documents  are  retrieved  by  this  search  statement. 
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TCI.  OCC. 

CO -OCC. 

TERM 

ASSOC.  COEFF 

2 

1 

adaptive 

0.0208 

3 

1 

ALGOL 

0.0139 

12 

2 

answers 

0.0139 

24 

24 

artificial  intelligence 

1.0000 

76 

1 

associate ion, ive) 

C.0005 

14 

2 

automata 

0.0119 

116 

1 

automatic 

0.0004 

47 

4 

behavior 

0.0142 

63 

2 

bibliography 

0.0026 

11 

1 

command  and  control 

0.0038 

75 

2 

communication 

0.0022 

187 

11 

computer 

0,0270 

29 

2 

concept 

0.0057 

12 

1 

context 

Q.Q03S 

11 

2 

control 

0.0152 

16 

4 

cybernetics 

0.0417 

33 

2 

decision 

0.0051 

;i6 

1 

engineering 

0.0012 

19 

1 

flow 

0.0022 

6 

1 

game 

0.0052 

4 

1 

yeometr(y, ical, ic) 

0.0104 

87 

2 

grammar 

0.0019 

13 

2 

heuristic 

0.0128 

183 

1 

information 

0.0002 

69 

3 

language,  artificial 

0.0054 

137 

5 

language,  natural 

0.0076 

20 

5 

learning 

0.0521 

131 

1 

linguistics 

0.0003 

10 

1 

list  processing 

0.0042 

78 

3 

logic 

0.0048 

73 

3 

machine 

0.0051 

35 

1 

man 

0.0012 

46 

1 

management 

0 .0009 

4 

1 

Markov 

0.0104 

55 

2 

mathematic( s, al) 

0.0030 

36 

1 

memory 

0.0012 

77 

4 

models 

0.0087 

3 

1 

neuron 

0.0139 

5 

2 

neuropathology 

0.0333 

16 

2 

pattern 

0.0104 

21 

6 

problem-solving 

0.0714 

84 

4 

program(  med,  ming ) 

C.0079 

15 

1 

Project  MAC 

0.0028 

21 

1 

psychology 

0.0020 

37 

2 

questions 

0.0045 

22 

2 

recognition 

0.0076 

5 

1 

recursive 

0.0083 

151 

2 

retrieval 

0.0011 

73 

2 

review 

0.0023 

17 

2 

self -organiz(ation, ing) 

0.0098 

48 

3 

semantic(s) 

0.0078 

25 

1 

simulation 

0.0017 

47 

1 

statistics, al) 

0.0009 

77 

1 

syntax 

0.0005 

248 

2 

systems 

0.0007 

7 

1 

teaching 

0.0060 

105 

4 

theory 

0.0063 

6 

2 

thinking 

0.0278 

31 

1 

transformations 

0.0013 

Figure  1.  Term  Profile  for  ARTIFICIAL  INTELLIGENCE 
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TOT.  OCC. 

46 

24 

25 
134 

73 

46 

71 

28 

32 

34 

19 

34 

109 

19 

7 

24 
6 

73 

172 

13 

75 
11 
45 
29 
23 

4 

49 

25 
31 
21 
19 

76 
15 

4 

25 

150 

73 

53 

2 

29 

11 

44 

241 

7 

49 


CQ -OCC.  TERM 


ASSOC.  COEFF. 


behavior 

biology 

center 

chemistry 

communication 

costs 

design 

dissemination 

document 

documentation 

education  (training) 

efficiency 

evaluation 

flow 

foreign 

format 

government 

index 

information 

input 

libra r(ies,y) 
literacy,  ature) 
management 
medi( cal, cine) 

micro( form, film, card, image, f iche ) 

operations 

organization 

patents 

periodicals 

psychology 

publication 

punched  cards 

questionnaire 

report 

requirements 

retrieval 

review 

scien(ce, tific) 

Science  Information  Exchange 

scientists 

standards 

storage 

systems 

teaching 

use 


0.0027 

0.0208 

0.0450 

0.0009 

0.0017 

0.0245 

0.0010 

0.285? 

0.0039 

0.0331 

0.0C66 

0.0037 

0.0046 

0.0066 

0.0714 

0.0052 

0.0208 

0.0017 

0.0465 

0.0095 

0.0067 

0.0114 

o.om 

0.0043 

0.0054 

0.0312 

0.0230 

0.0050 

0.0363 

0.0238 

0.0263 

0.0016 

0.0083 

0.0312 

0.0050 

0.0033 

0.0068 

0.0212 

0.0625 

0.0043 

0.0114 

0.0028 

0.0021 

0.0179 

0.0026 


Figure  2.  Term  Profile  for  DISSEMINATION  +  INFORMATION 
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The  associativity  formula  corresponds  to  the  product  of  the  condi¬ 
tional  probabilities  that,  given  a  document  which  one  term  indexes,  the 
other  teiiu  is  also  used  to  index  it.  The  formula  was  specifically  sug¬ 
gested  for  use  in  the  system  by  Stiles,  and  should  not  be  confused  with 
his  formula  based  on  the  chi-square  distribution,  which  he  uses  to 
calculate  an  association  factor  between  pairs  of  index  terms. 

In  the  retrieval  technique  originally  used  in  the  experiment,  the 
association  coefficients  served  as  weights  for  the  profile  terms  in  a 
second  search  of  the  document  collection.  The  list  of  profile  terms 
was  compared  with  the  index  terms  of  the  documents  in  the  collection 
and  the  weights  of  the  terms  that  matched  were  summed,  resulting  in  a 
relevance  number  for  each  document.  This  document  relevance  number  was 
used  to  present  the  documents  in  the  order  of  their  probable  relevance 
to  the  request.  Many  of  the  concepts  providing  the  theoretical  founda¬ 
tion  for  this  technique  were  expressed  by  Maron  and  Kuhns  C43-  The 
notion  of  summing  term  associativity  coefficients  to  produce  document 
relevance  numbers  was  shown  to  be  effective  on  an  existing  collection 
of  documents  by  Stiles  [5],  but  a  different  method  for  calculating  the 
weights  of  the  associated  term  was  employed. 

To  automatically  expand  the  original  output,  the  second  pass  was 
performed  on  the  entire  document  collection.  This  was  intended  to  en¬ 
able  the  retrieval  system  to  locate  documents  relevant  to  the  original 
request  even  though  those  documents  had  not  been  indexed  by  the  terms 
in  the  request.  To  automatically  narrow  the  original  output,  only  the 
documents  retrieved  by  the  initial  request  were  considered.  This  was 
based  on  the  assumptions  that  enough  potentially  relevant  documents  had 
already  been  retrieved  and  that  the  ranking  method  given  above  would 
present  the  documents  most  relevant  to  the  request  at  the  head  of  the 
list. 


The  experiment  attempted  to  answer  three  basic  questions: 

1.  Can  the  automatic  techniques  effectively  expand  the 
original  search  output? 

2.  Can  the  automatic  technique  effectively  narrow  the  origi¬ 
nal  search  output? 


**! 


\ 


3.  Assuming  an  affirmative  answer  to  the  first  two  ques¬ 
tions,  can  a  threshold  level  be  applied  to  the 
associativity  coefficients  which  reduces  the  size  of 
the  term  profile  without  influencing  the  results  of 
the  automatic  techniques? 

The  basis  for  this  experiment  was  the  document  collection  of  the 
Center  for  the  Information  Sciences  at  Lehigh  University.  This  col¬ 
lection  contains  approximately  2,500  documents,  which  are  manually 
indexed  using  a  thesaurus  composed  of  450  index  terms.  A  total  of  ten 
searches  were  performed  on  the  document  collection  and  provide  the  data 
on  which  the  results  are  based. 

In  designing  the  experiment,  the  intent  was  to  select  a  large 
enough  sample  of  initial  Boolean  search  statements  to  lend  an  accepta¬ 
ble  degree  of  validity  to  the  results.  This  was  dictated,  to  a  certain 
extent,  by  the  practical  restrictions  imposed  by  economies  and  time, 
and  by  the  very  real  consideration  of  the  absence  of  a  suitable  defini¬ 
tion  for  "a  large  enough  sample."  Despite  the  inevitable  small  sample 
size,  it  was  felt  that  a  certain  amount  of  insight  could  be  achieved 
through  the  selection  of  a  realistic  and  representative  group  of  search 
statements.  The  other  limitation  affecting  the  results  of  the  experi¬ 
ment  was  the  use  of  an  intuitive  basis  for  evaluation.  This  was  bom 
out  of  necessity,  however,  as  no  suitable  alternative  was  apparent. 

C.  Experimental  Results 

Table  2  in  the  Appendix  gives  the  results  obtained  from  the  initial 
search  statement,  "DISSEMINATION  +  INFORMATION, "  which  retrieves  every 
document  in  the  collection  indexed  by  the  terms  "dissemination"  and 
''information.”  This  request  produces  eight  documents,  which  are  de¬ 
noted  in  Table  2  by  an  asterisk  <*),  and  the  term  profile  presented 
earlier  in  Figure  2.  The  remaining  documents  in  Table  2  are  produced 
by  applying  the  associativity  coefficients  from  the  term  profile  to  the 
automatic  expansion  procedure.  The  resulting  highest  document  rele¬ 
vance  numbers  are  shown  in  Table  2,  along  with  the  ranks  and  titles  of 
the  documents.  It  should  be  noted  that  no  significance  is  attached  to 
the  document  relevance  number  except  as  a  basis  for  ranking  the  docu¬ 
ments  . 
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In  looking  at  the  documents  produced  by  the  expansion  technique, 
there  appear  to  be  several  documents  which  can  be  intuitively  judged  as 
highly  related  to  the  request,  and  therefore,  of  potential  interest  to 
the  user  who  wants  to  examine  more  than  the  original  eight.  However, 
the  search  terms  chosen  were  not  the  precise  terms  originally  used  to 
index  these  documents  and,  as  a  result,  they  were  not  retrieved.  This 
is  a  very  real  problem  in  a  coordinate  indexing  system  using  several 
hundred  terms  to  index  documents  on  various  aspects  of  a  particular 
subject.  The  indexer  tries  to  use  language  he  hopes  will  be  used  by 
future  requesters  and,  to  a  certain  extent,  the  user  tries  to  use  the 
language  he  thinks  the  indexer  used,  but  in  many  cases  they  don’t 
settle  on  the  same  set  of  terms. 

Although  the  results  illustrated  by  Figure  3  are  encouraging,  it 
must  be  remembered  that  the  expansion  was  based  on  the  indexing  of 
eight  documents.  Since  the  validity  of  the  information  contained  in 
the  term  profile  increases  with  the  amount  of  data  available  to  gener¬ 
ate  it,  the  ability  to  automatically  expand  a  search  retrieving  only 
one  or  two  documents  is  questioned.  In  an  operational  retrieval  system, 
expansion  is  probably  roost  desirable  when  the  original  output  is  small, 
so  to  be  of  any  practical  value,  the  expansion  technique  must  be  effec¬ 
tive  when  very  few  documents  are  initially  retrieved. 


TOT.  CCC. 

CO-OCC. 

TERM 

ASSOC.  COEFF. 

24 

1 

artificial  intelligence 

0.0417 

13 

1 

automata 

0.0769 

182 

1 

computer 

0.0055 

11 

1 

control 

0.0909 

19 

1 

learning 

0.0526 

54 

1 

mathematic ( s , al ) 

0.0185 

77 

1 

models 

0.0130 

15 

1 

pattern 

0.0667 

22 

1 

recognition 

0.0455 

17 

1 

self-organiz(ation,ing) 

0.0588 

237 

1 

systems 

0.0042 

103 

1 

theory 

0.0097 

Figure  3. 

Term  Profile  for  LEARNING  + 

SYSTEMS 
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Table  3  in  the  Appendix  represents  the  expansion  of  11  LEARNING  4 
SYSTEMS”  based  on  the  term  profile  shown  in  Figure  3.  This  request 
produced  only  one  document,  but  the  expansion  appears  to  have  produced 
some  potentially  relevant  documents.  It  should  be  noted,  however,  that 
the  document  originally  retrieved  was  indexed  by  twelve  terms,  slightly 
more  than  twice  the  average  in  the  CIS  document  collection.  This 
caused  the  generation  of  a  larger  term  profile  than  would  normally  be 
the  case  when  the  original  output  is  only  one  document.  A  smaller  term 
profile  could  have  resulted  in  less  satisfactory  expansion  results. 

Another  problem  caused  by  small  initial  output  becomes  apparent 
upon  closer  scrutiny  of  Figure  3.  When  a  term  profile  is  generated 
from  a  small  number  of  documents,  the  formula  for  calculating  associa¬ 
tivity  coefficients  appears  to  place  the  index  term  that  is  quite 
frequently  used  in  the  total  collection  at  an  unfair  disadvantage.  For 
example,  the  term  ''systems"  co-occurred  in  the  maximum  number  of  origi¬ 
nal  documents,  but  received  the  lowest  associativity  coefficient 
because  its  total  occurrence  in  the  document  collection  was  the  highest 
of  the  twelve  associated  terms.  However,  it  doesn't  appear  to  be 
detrimental  to  the  results  of  the  expansion  in  this  case. 

In  attempting  to  narrow  the  search  automatically,  the  assumption  is 
made  that  all  the  documents  originally  retrieved  are  potentially  rele¬ 
vant  to  the  user's  request.  It  remains  for  the  automatic  technique  to 
present  these  documents  in  their  order  of  relevance  to  the  request  so 
the  user  can  halt  the  retrieval  process  when  he  has  enough  documents  to 
work  with.  The  problem  lies  in  determining  the  order  of  relevance  of 
the  documents  to  the  request. 

The  technique  was  used  to  determine  the  order  of  relevance  of  the 
twenty-four  documents  retrieved  by  the  search  "ARTIFICIAL  INTELLIGENCE." 
The  term  profile  generated  for  this  search  is  given  in  Figure  1  and  the 
results  of  the  document  ordering  are  shown  in  Table  1  in  the  Appendix. 

It  is  obvious  from  this  example  that  when  a  summation  of  associativity 
coefficients  is  all  that  is  involved  in  producing  document  relevance 
numbers,  the  number  of  terms  used  to  index  a  document  becomes  an  ex¬ 
tremely  influential  parameter.  Since  the  term  profile  is  composed  of 
the  terms  indexing  the  documents  originally  retrieved,  an  increase  in 
the  number  of  index  terms  for  one  of  these  documents  can  only  result  in 
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an  increase  in  the  relevance  number  for  that  document.  This  relation¬ 
ship  becomes  quite  significant  when  the  document  collection  has  been 
manually  Indexed,  due  to  the  well-known  inconsistencies  introduced  into 
a  system  through  the  use  of  human  indexers.  Such  is  the  case  in  the 
CIS  system,  where  not  one,  but  several  indexers  have  been  employed . 
Based  on  this  shortcoming,  it  is  evident  that  a  modification  must  be 
made  to  the  original  automatic  technique  before  it  can  effectively 
narrow  a  search. 

It  should  be  pointed  out  that  the  problem  explained  above  has  no 
bearing  on  the  automatic  expansion  process  examined  earlier,  as  the 
concern  there  is  with  the  documents  not  retrieved  by  the  original 
search  statement. 

In  spite  of  the  unsatisfactory  results  involving  the  rankings  pro¬ 
duced  by  the  document  relevance  formula,  it  was  decided  to  continue  to 
experiment  with  the  use  of  a  cut-off  value  for  associativity  coeffi¬ 
cients.  This  called  for  using  only  profile  terms  with  associativity 
coefficients  greater  than  some  predetermined  threshold  to  calculate  the 
document  relevance  number.  The  threshold  chosen  for  the  experiment  was 
0.0125.  There  was  nothing  magic  about  that  particular  figure.  It  was 
chosen  because  it  had  been  used  previously  in  the  CIS  on-line  associa¬ 
tive  search  program  and  generally  produced  a  term  profile  of  from  ten 
to  fifteen  terms,  which  was  considered  a  good  number  to  work  with  for 
this  experiment . 

The  use  of  the  cut-off  does  not  drastically  affect  the  document 
rankings  in  any  of  the  searches  tested.  This  can  be  seen  by  the  com¬ 
parisons  given  of  ’’DISSEMINATION  +  INFORMATION’1  and  "ARTIFICIAL 
INTELLIGENCE”  in  Figure  4.  The  conclusions  are  that:  1)  the  highest 
document  relevance  numbers  are  primarily  the  result  of  a  few  terms  with 
high  associativity  coefficients  rether  than  several  terms  with  low 
associativity  coefficients,  and  2)  the  weights  of  the  terms  deleted  by 
the  cut-off  are  not  significant  enough  to  cause  any  appreciable 
fluctuation  in  the  document  relevance  numbers. 
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DISSEMINATION  +  INFORMATION 

ARTIFICIAL  INTELLIGENCE 

GRIG. 

ORIS. 

e 

CUT-OFF 

ORlG. 

ORIG. 

c.-o. 

CUT-OFF 

RANK 

doc.rel.no. 

RWfX 

doc.rel.no. 

fW 

doc.rel.no.,  rank 

DOCiREL.NO. 

1 

0.6320 

1 

0.5863 

1 

1.1858 

1 

Iil541 

2 

0*6106 

2 

0.5800 

2 

1.1527 

8 

1*0943 

3 

0*5413 

3 

0.4790 

3 

1*1467 

2 

1.1363 

4 

0.4669 

5 

0.4314 

4 

1*1419 

4 

1.1123 

S 

0.4563 

4 

0.4498 

5 

1.1339 

tt 

» 

6 

0.4232 

7 

0.4172 

6 

1.2264 

6 

1.1107 

7 

0.4192 

6 

0.4192 

7 

1.1254 

3 

1.1254 

6 

0.3772 

8 

0.3772 

8 

1.1086 

9 

1,0933 

9 

0.3409 

9 

0.3322 

9 

1.1021 

10 

1.0799 

10 

0.3350 

10 

0.3307 

10 

1.0984 

7 

1.0984 

11 

0.3255 

11 

0.3220 

11 

1.0916 

11 

1.0714 

12 

0.3155 

12 

0.3095 

12 

1.0596 

16 

1.0270 

13 

0.3006 

13 

0.2857 

13 

1.0568 

12 

1.0559 

14 

0.2968 

It 

tt 

14 

1.0521 

13 

1.0417 

15 

0.2965 

ft 

tt 

15 

1.0514 

14 

1.0409 

16 

0.2939 

It 

tt 

16 

1.0490 

18 

1.0208 

17 

0.2911 

It 

tt 

17 

1.0482 

15 

1.0333 

18 

0.2903 

It 

ft 

18 

1.0327 

16 

1.0270 

19 

0.2892 

tt 

19 

1.0318 

21 

1.0000 

20 

0.2890 

tt 

20 

1.0275 

19 

1.0152 

21 

0.2378 

ti 

IT 

21 

1.0251 

20 

1.0139 

22 

0.2866 

it 

It 

22 

1.0194 

21 

1.0000 

23 

0.2857 

" 

tt 

23 

1.0026 

tl 

11 

IT 

V 

tt 

tt 

24 

1.0000 

ft 

It 

It 

It 

n 

it 

It 

IT 

ti 

ft 

« 

It 

It 

w 

11 

tl 

ft 

n 

29 

0.1813 

29 

0.1702 

30 

0.1503 

32 

0.1278 

Figure  4 


This  implies  that  the  automatic  technique  being  tested  will  still 
be  effective  in  expanding  a  search  when  the  cut-off  is  used.  However, 
it  remains  to  find  a  suitable  modification  to  the  document  relevance 
formula  before  any  degree  of  success  is  achieved  in  automatically 
narrowing  a  search.  In  both  cases,  the  use  of  a  cut-off  will  result  in 
a  considerable  savings  in  computer  time,  owing  to  the  sizable  reduction 
in  the  number  of  index  terms  used. 
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It  had  been  pointed  out  earlier  that  the  modification  to  the  docu¬ 
ment  relevance  formula  was  necessary  to  neutralize  the  inconsistencies 
introduced  by  the  indexing  process.  Of  primary  concern  was  the  depth- 
of -indexing  variation  among  indexers.  With  only  a  summation  of  term 
associativity  coefficients  determining  the  degree  of  relevance  of  the 
retrieved  documents  to  the  original  request,  it  was  feared  that  some 
documents  might  achieve  unwarranted  high  rankings  because  one  indexer 
typically  assigned  more  index  terms  than  the  others. 

After  much  experimentation,  it  was  determined  to  use  the  following 
modification  for  calculating  the  new  relevance  number  for  a  document; 

DOCUMENT  RELEVANCE  NUMBER  =  ,  where 

T 

S  =  document  relevance  number  as  calculated  previously, 
but  using  only  terms  with  associativity  coeffi¬ 
cients  >  0.0125  in  the  stmanation. 

N  =  number  of  terms  with  associativity  coefficients 
>  0.0125  indexing  the  document. 

T  -  total  number  of  terms  indexing  the  document. 

The  affect  of  this  formula  can  be  seen  in  Figure  5.  This  shows  the 
new  ordering  of  the  documents  retrieved  by  "ARTIFICIAL  INTELLIGENCE," 
and  compares  it  to  the  ranking  produced  for  the  same  search  by  the 
original  and  cut-off  methods  presented  earlier.  The  total  number  of 
index  terms  and  the  index  terms  with  associativity  coefficients  _  0.0125 
are  also  given  for  each  document.  In  the  latter  column,  "a.i."  is  used 
for  the  index  term  "artificial  intelligence.11  The  titles  of  the  docu¬ 
ments  were  given  in  TABLE  1  and  the  term  profile  for  the  search 
"ARTIFICIAL  INTELLIGENCE"  was  shown  in  Figure  1. 

While  the  cut-off  me r hod  produced  no  major  deviation  from  the  orig¬ 
inal  results,  it  is  apparent  that  the  new  document  relevance  number 
formula  has.  The  documents  heading  the  list  are  generally  those  indexed 
by  the  greatest  percentage  of  1  significant"  index  terms  (those  which 
survived  the  cut-off).  At  the  same  time,  the  relationships  among  index 
terms,  provided  by  the  associativity  coefficients,  remain  an  important 
parameter  and  still  influence  the  final  document  ranking.  With  the  new 
formula,  a  document  is  not  penalized  for  having  a  greater  than  average 
number  of  index  terms  if  those  terms  are  "significant." 
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ARTIFICIAL  INTELLIGENCE 


ORIG.  C.-O.  NEW  NEW  NO. 
RANK  RANK  RANK  DOC.REL.NO.  TERMS 


1 

9 

0.5775 

10 

B 

19 

0.3648 

12 

2 

4 

0.9092 

5 

4 

12 

0.4944 

9 

4 

12 

0.4944 

9 

6 

10 

0.5554 

10 

3 

1 

1.1254 

5 

9 

8 

0.6248 

7 

10 

14 

0.4623 

7 

7 

2 

1.0984 

3 

11 

15 

0.4286 

5 

16 

23 

0.2054 

10 

12 

5 

0.7920 

4 

13 

6 

0.6944 

3 

14 

20 

0.3093 

8 

18 

21 

0.2552 

8 

15 

16 

0.4134 

5 

16 

7 

0.6846 

3 

21 

24 

0.1429 

7 

19 

17 

0.4060 

5 

20 

18 

0.4056 

5 

21 

22 

0.2500 

4 

21 

11 

0.5000 

2 

21 

3 

1.0000 

1 

TERMS  Z  0.0125 


a.i.,  learning,  neuropath© logy , 
computer,  cybernetics 
a.i.,  control,  computer,  learning 
a.i.,  heuristic,  learning,  problem¬ 
solving 

a.i.,  problem-solving,  answers, 
computer 

a.i.,  problem-solving,  answers, 
computer 

a.i.,  behavior,  computer,  cybernetics, 

thi.nld.ng 

a.i.,  computer,  behavior,  problem¬ 
solving,  heuristic 
a.i.,  beliavior,  learning,  computer 
a.i.,  thinking,  learning 
a.i.,  problem-solving,  computer 
a.i.,  problem-solving 
a.i.,  computer 
a.i.,  cybernetics,  behavior 
a.i.,  cybernetics 
a.i.,  computer,  ALGOL 
a.i.,  adaptive 
a.i.,  neuropathology 
a.i.,  computer 
a.i. 

a.i.,  control 
a.i.,  neuron 

a.i* 


Figure  5 


1-14 


The  use  of  this  formula  provides  a  major  improvement  in  the  rank¬ 
ings  of  the  documents  and  therefore  in  the  ability  to  automatically 
narrow  the  search.  It  should  be  noted  that,  for  purposes  of  narrowing 
the  search,  this  formula  could  not  be  used  without  the  existence  of  a 
cut-off,  since  every  index  term  in  the  documents  originally  retrieved 
appears  in  the  term  profile.  So  the  cut-off  is  not  merely  important  in 
saving  computer  time,  it  is  mandatory  if  the  automatic  narrowing  tech¬ 
nique  is  to  function  properly. 

Since  it  would  be  more  convenient  to  use  the  same  document  rele¬ 
vance  formula  in  both  the  automatic  expansion  and  narrowing  techniques, 
the  new  formula  was  tested  with  the  automatic  expansion  technique,  de¬ 
spite  the  fact  that  the  technique  had  already  proven  to  be  effective. 

Figure  6  illustrates  the  type  of  results  produced  by  the  use  of  the 
new  document  relevance  number  formula  with  the  automatic  expansion 
technique.  The  headings  for  this  figure  are  identical  to  those  used  in 
Figure  S.  The  documents  originally  retrieved  are  denoted  by  an  aster¬ 
isk  (*)  before  the  index  terms.  The  original  expansion  of  the  search 
"DISSEMINATION  +  INFORMATION"  was  given  in  TABLE  2  and  the  term  profile 
was  shown  in  Figure  2. 
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DISSEMINATION  +  INFORMATION 


ORIG.  C.-O.  NEW  NEW  NO. 
RANK  RANK  RANK  DOC.REL.NO.  TERMS 


TERMS  Z  0.0125 


0.2932 


0.3988 


0.2579 


0.1327 

0.3374 


0.2980 

0.4192 

0.3772 

0.1329 

0.2480 

0.1288 

0.1857 

0.0319 

0.0952 

0.0408 

0.0952 

0.0952 

0.0714 

0.0714 

0.0952 

0.0952 

0.1429 

0.2857 

0.0952 

0.1429 

0.0714 

0.1429 

0.0714 

0.1418 

0.0383 


18  *dissenunation,  information,  foreign, 
biology,  periodicals,  center, 
publication,  documentation,  scien- 
16  *dis semination,  information,  teaching, 

biology,  librar-,  costs, 
organ! zat-,  documentation,  patents, 
foreign,  government 

13  *dis semination,  information,  scien-, 

organization,  center,  costs, 
documentation 

13  *dissemination,  information 
operations,  biology 

8  ‘dissemination,  information, 

periodicals,  psychology, 
publication,  report 

7  dissemination,  foreign,  periodicals, 
psychology 

4  *dis semination,  information,  science 

info  exchange,  costs 

4  ‘dissemination,  information, 

psychology,  scien- 

5  ‘dissemination,  information 

4  dissemination,  scien-,  psychology 

5  dissemination,  periodicals 

5  dissemination,  psychology,  science 

info  exchange 

9  dissemination 

3  dissemination 

7  dissemination 

3  dissemination 

3  dissemination 

4  dissemination 

4  dissemination 

3  dissemination 

3  dissemination 

2  dissemination 

1  dissemination 

3  dissemination 

2  dissemination 

4  dissemination 

2  dissemination 

4  dissemination 

6  information,  organization,  center, 

costs,  operations 

10  information,  periodicals,  center 


Figure  6 
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The  change  in  rankings  shown  in  Figure  6  is  not  too  pronounced. 

The  documents  retrieved  by  the  original  expansion  were  also  retrieved 
using  the  new  formula#  so  in  that  sense  the  automatic  expansion  tech¬ 
nique  is  still  effective,  because  it  still  provides  the  user  with 
potentially  relevant  documents  he  missed  with  his  original  search. 

Since  the  problems  encountered  with  the  automatic  narrowing  tech¬ 
nique  have  been  solved  by  the  change  in  the  document  relevance  number 
formula,  and  the  automatic  expansion  technique,  for  all  practical 
purposes,  performs  as  well  with  that  formula  as  it  did  with  those  pre¬ 
viously  tested,  it  was  determined  that  a  satisfactory  method  had  been 
found  to  both  automatically  expand  and  automatically  narrow  the  output 
from  the  initial  search  statement. 

In  summarizing,  the  basic  steps  in  the  proposed  automatic  retrieval 
method  for  the  CIS  system  are: 

1.  Generate  a  term  profile  from  the  user’s  initial  Boolean 
search  statement  using  the  existing  formula  to  calcu¬ 
late  the  term  associativity  coefficients. 

2.  Using  only  terms  with  associativity  coefficients 

a  0.012S,  compare  the  list  of  profile  terms  with  the 
index  terms  of  each  document  and  add  the  associativity 
coefficients  of  the  terms  that  match.  The  sum,  S,  of 
the  weights  is  used  to  calculate  the  document  relevance 
number.  To  expand  the  original  search,  perform  this 
step  for  every  document  in  the  collection.  To  narrow 
the  original  search,  use  only  the  documents  originally 
retrieved. 

3.  For  each  document,  multiply  S  from  step  2  by  the 
number  of  terms  with  associativity  coefficients 

*  0.0125  indexing  that  document.  Then  divide  the  prod¬ 
uct  by  the  total  number  of  terms  indexing  the  document. 

The  result  is  the  document  relevance  number. 

4.  Present  the  documents  to  the  user  in  the  order  of  their 
probable  relevance  to  his  request. 


D.  Conclusion 


The  CIS  document  retrieval  system  must  be  prepared  to  serve  many 
different  types  of  users,  each  of  whom  may  have  different  needs.  Be¬ 
cause  of  this,  it  is  unreasonable  to  expect  a  one-pass  search  process 
to  satisfy  all  user  classes.  This  problem  has  been  alleviated  somewhat 
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in  the  system  by  the  introduction  of  a  semi-automatic  multiple -pass 
search  strategy.  This  procedure  leaves  the  search  reformulation  in  the 
user’s  hands  and  allows  for  different  adjustments  from  user  to  user, 
depending  on  individual  needs.  However#  there  will  still  be  times  when 
the  user  is  unable  to  optimize  his  search  results  by  re forme la ting  his 
own  request.  When  this  occurs#  his  needs  may  best  be  served  by  a  fully 
automatic  retrieval  strategy  in  which  his  only  function  is  to  criticize 
the  initial  search  as  being  toe  narrow  or  too  broad.  The  results  of 
this  experiment  indicate  that  the  proposed  automatic  retrieval  tech¬ 
nique  could  be  successful  in  that  exact  situation. 

Another  problem  has  been  created  in  the  CIS  system  by  the  addition 
of  new  index  terms  to  the  thesaurus.  When  a  new  term  is  added,  it  is 
not  feasible  to  re-index  the  entire  document  collection  based  on  that 
term.  The  addition  of  a  new  terra  could  result  in  documents  not  being 
indexed  by  that  term  which#  in  fact,  should  be.  Consequently,  the 
possibility  exists  of  potentially  relevant  documents  not  being  retriev¬ 
ed  when  the  new  term  is  used  in  a  search  statement.  It  is  believed  the 
automatic  expansion  technique  described  in  this  paper  could  retrieve 
many  of  these  documents,  although  not  enough  experimentation  was  per¬ 
formed  to  present  any  valid  evidence  to  that  effect. 

The  proposed  technique  lends  itself  quite  easily  to  an  on-line 
environment.  Although  the  search  procedure  itself  is  completely  auto¬ 
matic#  the  user  maintains  control  with  the  capability  of  halting  the 
operation  any  time  he  has  enough  documents  to  satisfy  his  needs.  A 
conversational  mode  could  be  developed  to  offer  the  user  a  choice  among 
a  non*^.ssociative  search,  a  semi-automatic  associative  search#  and  a 
completely  automatic  associative  search.  The  first  decision  could  be 
between  non-associative  and  associative  search.  If  the  latter  morion 
was  taken,  a  second  decision  would  have  to  be  made  between  th-i  two 
types  of  associative  searches. 

In  an  off-line,  batched  processing  environment#  the  proposed  tech¬ 
nique  could  still  operate  if  the  user  knew  prior  to  the  run  how  many 
documents  he  wanted  to  retrieve.  After  the  initial  search  had  been 
completed,  a  comparison  would  be  made  between  the  number  of  documents 
retrieved  and  the  number  of  documents  the  user  wanted  to  retrieve. 

Based  on  this  comparison#  the  program  would  either  automatically  expand 
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the  original  search,  automatically  narrow  the  original  search,  or  stop 
searching  altogether.  The  final  operation  would  he  the  presentation  of 
the  exact  number  of  documents  the  user  specified  in  the  order  of  their 
relevance  to  his  initial  request.  Since  no  manual  intervention  is  re¬ 
quired,  the  user  would  not  incur  a  delay  in  the  processing  of  his 
search  request. 

In  either  of  the  aforementioned  environments,  the  biggest  problems 
in  implementing  the  proposed  technique  are  the  limitations  imposed  by 
sorting  the  documents  in  relevance  number  sequence.  A  time-consuming 
tape-sort  operation  is  considered  prohibitive,  especially  in  an  n-line 
environment.  It  is  not  unrealistic  to  consider  a  core-sort  for  the 
automatic  narrowing  procedure,  since  there  generally  exists  a  relatively 
small,  fixed  number  of  document  records  being  treated.  However,  an 
algorithm  must  be  derived  to  reduce  the  number  of  document  records  being 
sorted  before  the  automatic  expansion  technique  can  be  handled  in  core. 
If  this  problem  can  be  alleviated,  it  appears  likely  the  proposed  tech¬ 
nique  could  perform  within,  acceptable  time  limits  in  the  CIS  system. 

Although  the  results  presented  in  this  paper  are  encouraging,  in 
the  last  analysis,  the  best  way  to  evaluate  the  effectiveness  of  any 
technique  such  as  the  one  described  is  to  observe  its  use  in  an  opera¬ 
tional  environment. 
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Abstract 


This  paper  is. concerned  with  the  design  and  construction  of  the 
retrieval  component  or  a  document  retrieval  system.  A  text  processing 
scheme  is. defined. for. syntactic, ebd . ?ewantic  reduction  of  full  text. 

A  retrieval  model  is  defined  and  constructed  in  such  a  way  as  to  be 
compatible  with  the  document  characterization  process  described.  The 
use  of  natural  language  communication  is  provided' for  the  inquirer  and 
the  system’s  browsability.  capability  is  described..  .  _ _ _ _ 
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Introduction 

The  field  of  Information  Storage  and  Retrieval  is  concerned  with 
the  collection,  organization  and  retrieval  of  recorded  information,  and 
recently,  the  means  by  which  these  processes  can  be  automated. 

This  paper  will  be  concerned  with  the  automation  of  the  retrieval 
component  of  a  document  retrieval  system  and  in  particular,  with  the 
description  of  a  natural  language  man-machine  communication  scheme 
which  will  provide  a  browsability  feature  for  the  user.  The  feasibili¬ 
ty  of  automating  such  a  system  will  also  be  discussed. 

General  Text  Processing  Scheme 

The  theory  of  a  natural  language  query  scheme  for  an  automated 
document  retrieval  system  requires  that  the  system  be  compatible  with 
the  text  processing  scheme  used  to  describe  or  characterize  the  docu¬ 
ment  collection.  Hillman  Cl]  states  that 

"A  theory  of  document  retrieval  is  a  deductive  system 
of  the  operations  governing  the  retrieval  of  those 
documents  whose  representations  contain  characteris¬ 
tics  (index  terms)  judged  to  be  relevant  to  the 
terms  of  a  query. “ 

To  satisfy  this  requirement,  it  will  be  assumed  that  the  document  col¬ 
lection  has  been  characterized  by  the  automatic  syntactic  text  process¬ 
ing  system  developed  by  Hillman  and  Reed  [2],  An  abridged  description 
of  this  text  processing  system  will  be  given  at  this  point  to  provide 
the  reader  with  some  degree  of  familiarity  with  the  system. 

Document  Characterization .  A  major  hypothesis  of  the  theory  of  text 
processing  is  that  the  characteristics  assigned  to  a  document  give  some 
indication  of  what  it  is  that  the  document  is  about.  This  aboutness  is 
regarded  as  an  a  priori  matter  of  logic,  semantics  and  syntax. 

In  order  to  determine  what  a  document  is  about,  a  scheme  was  de¬ 
vised  to  identify  the  topic -denoting  expressions  occupying  referential 
position  within  the  sentences  of  a  document.  In  English,  this 
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referential  position  is  occupied  by  the  noun  phrase.  Thus,  the  text 
processing  system  examines  the  sentences  of  each  document  contained  in 
a  document  collection  and  identifies  the  noun  phrases  in  each  sentence 
of  the  text.  This  process  uses  a  computational  scheme  of  syntactic 
analysis  of  text  developed  by  Hillman  and  Reed  [3].  It  is  based  on  a 
context-sensitive  computational  grammar  which  makes  use  of  a  limited 
dictionary  look-up.  The  dictionary  consists  of  about  three  hundred 
functor  words  and  suffixes.  Appendix  I  gives  a  listing  of  the  items  in 
the  dictionary.  The  analyzer  assigns  a  syntactic  category  to  each  word 
of  the  input  document  text  and  identifies  nominal,  prepositional  and 
infinitive  phrases.  The  analyzer  also  segments  the  input  sentences 
into  micro-sentences,  that  is,  into  syntactically  simple  sentences. 

This  initial  step  in  text  processing  is  called  "micro-categorization, 11 
The  next  step  in  document  characterization  is  a  process  termed  ''macro¬ 
categorization  . "  This  is  a  method  by  which  the  topic -denoting 
expressions  and  their  predicates  are  identified.  The  process  consists 
of  two  steps,  the  first  of  which  consolidates  the  microcategories  into 
larger  units  called  "macrocategories."  In  the  second  step,  the  topic¬ 
denoting  expressions  (potential  document  characteristics)  and  their 
predicates  are  isolated.  These  topic  denoting  expressions  are  the  keys 
to  the  documents  since  they  reference  the  major  topics  about  which  the 
documents  make  assertions.  Appendix  II  gives  some  examples  of  input 
and  output  of  the  microcategorization  and  macro categorization  processes 
in  this  text  processing  system. 

The  final  step  in  the  processing  of  the  text  deals  with  assigning  a 
measure  or  weight  to  each  document  characteristic  and  the  process  of 
vocabulary  control.  After  macrocategorization  of  the  text  has  been 
completed,  the  document  characteristics  are  merged  and  sorted  into 
alphabetic  order.  Like  characteristics  are  combined  and  counted. 

Next,  a  measure  of  term-document  connectivity  is  assigned  to  each  term- 
document  pair.  This  is  done  using  the  notion  of  ’  lines  of  connection1' 
described  by  Hillman  [4]  and  Goodman  [5].  If  the  predicates  are 
thought  of  as  relations  and  the  document  characteristics  their  argu¬ 
ments,  then  a  characteristic  term  t  will  have  n  lines  of  connection  to 
a  predicate  P  if  P  is  an  n-place  predicate.  Similarly,  a  characteris¬ 
tic  term  t  will  have  m  lines  of  connection  to  a  document  D  if  m  is  the 
sum  of  all  lines  of  connection  between  t  and  the  predicates  P  of 


document  D  in  which  t  appears  as  an  argument.  The  latter  is  clearly  an 
assumption  of  linearity  since  a  characteristic  term  t  will  have  8  lines 
of  connection  to  document  D  if  t  appears  as,  for  instance,  an  argument 
of  a  four-termed  predicate  P^,  a  three-termed  predicate  Pj  and  a  one- 

termed  predicate  all  of  which  occur  in  document  D.  A  term -document 

matrix,  called  the  affiliation  matrix  is  set  up  with  entries  correspond¬ 
ing  to  the  lines  of  connection  between  documents  and  characteristics. 
This  matrix  is  then  multiplied  by  its  transpose  and  the  resulting 
matrix  is  a  term -term  matrix  called  incidence  matrix.  Its  entries 
establish  connections  between  characteristics  via  some  document  and  is 
a  measure  of  first -level  connectivity  for  an  n -termed  predicate.  The 
incidence  matrix  is  then  partitioned  into  its  components  called  transi¬ 
tion  matrices  and  each  is  normalized.  These  transition  matrices  repre¬ 
sent  distinct  genera  of  terms  and  hence,  are  highly  associated  with 
each  other.  Finally,  from  each  transition  matrix,  a  unique  probability 
vector  is  extracted  and  normalized.  This  vector  consists  of  those 
characteristics  occupying  the  most  central  positions  in  a  genus.  The 
result  of  this  text  processing  scheme  produces  for  each  document  a  set 
of  characteristics  which  identify  the  major  topics  referred  to  in  the 
document.  The  weight  of  each  characteristic  in  a  document  provides  a 
measure  of  its  association  with  the  document,  and  a  given  characteris¬ 
tic  will  usually  have  different  weights  relative  to  different  documents. 
And  no  less  important,  the  genus  structure  provides  a  powerful  tool  to 
be  used  in  retrieval  of  documents  from  the  collection.  Appendix  XU 
provides  an  illustration  of  this  automated  process. 

The  Document  Corpus.  In  order  for  a  document  retrieval  system  such  as 
the  one  being  proposed  here  to  be  realistic,  document  collections  con¬ 
taining  100,000  documents  or  more  would  be  appropriate.  The  measure 
of  connectivity  between  characteristics  and  documents  would  best  reflect 
the  behavioral  regularities  inherent  in  such  large  collections.  How¬ 
ever,  it  is  not  necessary,  at  least  initially,  to  have  such  a  large 
document  collection.  A  collection  of  documents  such  as  that  contained 
in  the  Center  for  the  Information  Sciences  (C.I.S.)  collection  would 
certainly  suffice.  This  is  a  fairly  homogeneous  non-static  collection 
of  about  2,500  documents  treating  a  wide  variety  of  topics  related  to 
Information  Science  and  Retrieval. 
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STRICTURE  OF  THE  RETRIEVAL  COMPONENT 
A  Formal  Retrieval  Model 

Once  the  document  characterization  scheme  has  been  selected  for  the 
document  retrieval  system,  care  should  be  exercised  in  constructing  an 
appropriate  retrieval  component  for  the  system. 

In  very  general  terms,  the  retrieval  component  of  a  document  re¬ 
trieval  system  can  be  thought  of  as  consisting  of  a  document  space  D,  a 
retrieval  prescription  space  P,  and  a  mapping  or  transformation  T  which 
transforms  a  prescription  from  P  into  a  set  of  documents  from  D. 


Figure  1.  A  Generalized  Retrieval  Component  Scheme 


The  process  of  formally  defining  the  retrieval  component  usually 
consists  of  imposing  some  kind  of  mathematical  structure  on  D  and  P  and 
then  defining  a  transformation  T  in  such  a  way  that  when  the  transfor¬ 
mation  T  is  applied  to  a  retrieval  prescription  it  will  produce  a 
relevant  set  of  documents  from  D.  However,  the  mathematical  structures 
imposed  on  the  document  space  D  and  the  prescription  space  P  are 
usually  imposed  on  D  and  P  without  regard  to  the  document  characteriza¬ 
tion  conponent  of  the  document  retrieval  systen. 

For  example,  if  a  document  collection  C  is  the  basis  of  a  document 
retrieval  system,  the  document  space  D  can  be  thought  of  as  consisting 
of  all  possible  subsets  of  documents  formed  from  the  collection  C.  Note 
that  if  C  contains  n  distinct  documents,  then  the  space  D  would  consist 


of  2n  distinct  subsets  of  documents.  Since  the  document  space  D  is  the 
power  set  of  a  finite  aggregate  of  documents,  an  obvious  structure  for 
D  immediately  suggests  itself.  That  is,  the  document  space  D  is  a 
finite  Boolean  algebra.  The  prescription  space  P  is  often  given  a  more 
complicated  structure  that  depends  on  the  kind  of  index  term  to  be  used 
in  P.  If  R  is  a  repertory  of  simple  descriptors,  Mooers  C6J  points  out 
that  the  descriptors  can  be  thought  of  as  cvo -element  partially  ordered 
systems  where  either  the  descriptor  A  is  asserted  as  providing  a  clue 
to  the  document  message,  or  no  assertion,  one  way  or  the  other,  is  made 
about  A.  The  space  P  is  simply  the  cardinal  product  of  the  two-element 
partially  ordered  systems  in  the  repertory  R.  This  space  P  of  the 
retrieval  prescriptions  based  on  simple  descriptors  is  a  Boolean  lat¬ 
tice.  Each  point  x  in  P  is  a  subset  of  descriptors  from  the  repertory 
R.  Given  any  such  point  x  in  the  space  P,  there  is  a  large  family  X  of 
other  points  in  P,  each  of  which  is  "preceded’1  by  the  point  x.  Con¬ 
sidering  now  the  space  D,  there  are  many  documents  whose  assigned  sub¬ 
set  of  descriptors  is  one  of  the  points  belonging  to  the  family  X.  If 
x*  is  the  largest  set  of  documents  which  are  indexed  by  a  subset  of 
descriptors  in  X,  then  the  transformation  T  of  the  point  x  in  P  is  the 
point  x*  in  D>  The  transformation  T  is  the  basis  of  selection  in 
actual  document  retrieval  systems  based  on  descriptors. 

It  is  quite  clear  that  the  mathematical  structures  imposed  on  the 
document  space  D  and  the  prescription  space  P  are  compatible  and  also 
that  the  transformation  T  is  well-defined.  However,  there  is  no  reason 
to  expect  that  the  abstract  mathematical  structures  imposed  on  D  and  P 
are,  in  fact,  the  real  structures  induced  in  D  and  P  by  the  document 
characterization  process.  Therefore,  the  transformation  T  cannot  be 
expected  to  be  very  efficient  in  the  actual  document  retrieval  process. 

The  Induced  Retrieval  Model 

From  the  discussion  above,  it  is  clear  chat  a  formal  approach  to 
the  construction  of  the  retrieval  component  for  a  document  retrieval 
system  is  not  very  useful.  Since  a  document  characterization  process 
lias  been  selected  for  the  retrieval  system,  the  obvious  starting  point 
in  determining  an  appropriate  retrieval  structure  lies  with  the  docu¬ 
ment  characterization  process  itself. 
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lc  seems  quite  reasonable  to  suspect  that  the  structures  induced  in 
the  document  collection  by  the  document  characterization  scheme  should 
provide  a  clue  to  what  features  should  be  incorporated  into  the  retriev¬ 
al  component  to  make  it  compatible  with  the  other  system  components . 

The  document  characterization  scheme  of  Hillman  [7]  described  ear¬ 
lier  in  this  paper  utilizes  both  syntactic  and  semantic  analysis  of 
full  text,  along  with  certain  matrix  operations  and  Markov  processes  to 
isolate  sets  of  highly  connected  document  characteristics.  As  a  re¬ 
sult,  the  document  characterization  process  induces  a  partition  of  the 
document  space  D  into  mutually  disjoint  sets  of  connected  documents. 

For  instance,  consider  the  result  of  the  document  characterization 
process  applied  to  the  document  collection  C  given  in  the  example  in 
Appendix  III. 


Figure  2.  The  Partition  and  Structure  of  the  Document  Space 
D  Induced  by  the  Document  Characterization  Process 
at  the  Document  Level. 


The  structure  of  D  at  the  document  level  clearly  shows  the  relation¬ 
ships  that  do  exist  between  the  various  documents  in  this  trivial 
example,  although  nothing  can  be  determined  about  the  nature  and  the 
relative  strength  of  the  connections  between  the  related  documents. 
Thus,  it  appears  that  a  more  detailed  picture  of  the  structure  of  the 


document  space  D  is  required  if  any  insight  is  to  be  gained  about  the 
structure  erf  the  appropriate  retrieval  model.  This  can  be  done  by 
simply  adding  the  document  characteristics  to  their  respective  docu¬ 
ments  and  modifying  the  picture  in  Figure  2  in  the  obvious  manner. 


Figure  3.  The  Partition  and  Structure  of  the 
Document  Space  D  Induced  by  the 
Document  Characterisation  Process 
at  an  Intermediate  Structure  Level 


At  this  intermediate  level,  the  picture  becomes  a  little  clearer. 

It  is  now  possible  to  define  the  distribution  of  the  document  character¬ 
istics  between  documents  in  the  document  space  D.  Note  that  for  this 
particular  example,  the  document  U  seems  to  tie  the  other  documents  in 
the  genus  together,  and  if  it  were  removed,  the  document  space  D  would 
be  broken  down  into  five  distinct  genera  rather  than  the  three  genera 
defined  in  Figure  3.  The  document  characteristics  a,  h,  and  j  defining 
the  document  U  are  called  articulation  points  of  a  first-level  genus. 


Another  point#  one  that  has  important  relation  to  the  construction 
of  the  retrieval  component,  is  that  not  every  retrieval  prescription  in 
the  prescription  space  P  will  have  its  image  defined  in  D.  There  are 
several  reasons  for  this  situation.  The  most  pronounced  case  occurs 
when  a  prescription  contains  document  characteristics  belonging  to  two 
or  more  genera.  For  instance,  if  the  prescription  were  composed  of  the 
document  characteristics  b,  a,  and  h,  then  no  document  or  set  of  docu¬ 
ments  in  B  could  satisfy  the  prescription.  The  result  would  be  the 
empty  set  of  documents.  The  other  situations  that  would  result  in  the 
empty  document  set  are  the  obvious  ones,  foreign  document  characteris¬ 
tics,  non-existent  documents  (within  the  collection),  etc.  In  all  of 
these  situations,  the  retrieval  system  must  be  capable  of  guiding  an 
inquirer  by  providing  relevant  clues  via  inquirer-retrieval  system 
interaction  in  reformulating  his  original  request.  This  will  necessari¬ 
ly  make  the  response  transformation  T  quite  complicated,  in  fact,  it  is 
very  likely  that  no  single  transformation  will  suffice. 

It  is  now  quite  apparent  that  the  structure  induced  on  the  space  D 
by  the  document  characterization  process  is  much  finer  than  the  Boolean 
algebra  structure  of  P(C).  The  elements  in  the  induced  structure  are, 
of  course,  also  elements  in  P(C).  But  the  elements  of  the  induced 
structure  in  D  are  a  very  small  "select”  subset  of  the  power  set  P(C). 
What  the  mathematical  structure  will  be  for  tine  induced  structure  is 
difficult  to  say.  Further  investigation  in  this  direction  is  being 
presently  carried  on  by  the  author.  However,  it  is  not  necessary  to 
know  what  the  mathematical  structure  is  in  order  to  construct  the  re¬ 
trieval  component.  The  induced  structure  provides  sufficient  informa¬ 
tion. 

The  document  characterization  process  permits  still  another  refine¬ 
ment  of  the  document  space  structure.  It  is  possible  to  interpret  the 
incidence  matrix  (see  Appendix  IH)  as  a  graph  G  over  the  document 
characteristic  repertory  R.  By  the  graph  G(R)  is  meant  that  a  given 
vertex  set  R  of  document  characteristics  forms  a  family  of  associations 

A  =  (Pi.  Pj) 

where  P^  P^  4  R,  indicating  which  vertices  are  connected  characteris¬ 
tics.  Each  association  A  is  called  an  edge  of  the  graph,  and  the 


multiplicity  of  an  edge#  denoted  by 


v  =  pi> 

is  simply  the  number  of  edges  connecting  the  vertices  P^  end  .  Since 

a  document  characteristic  is  connected  with  itself,  there  is  always  an 
edge  between  any  vertex  and  itself.  Applying  these  notions  to  the 
example  of  Appendix  IH  we  get  the  following  view  of  D. 


Figure  4.  The  Graphic  Structure  of  the 

Document  Space  D  Induced  by  the 
Document  Characterization  Process. 


At  this  level  of  refinement,  documents  qua  documents  lose  their 
identity  and  the  structures  discemable  in  the  space  D  are  only  those 
structures  defined  on  the  document  characteristics  within  the  various 
genera.  In  essence,  this  is  the  concept  level  of  the  document  space  D 
and#  in  a  sense,  of  the  retrieval  component  itself.  Intuitively#  a 
prescription  is  simply  a  set  of  characteristics  selected  by  an  inquirer 
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as  defining,  in  his  mii.d,  a  concept  (or  concepts)  about  which  he  is 
interested.  Actually  the  prescription  defines  what  might  be  termed  a 
"pro  to -concept1’  since  the  inquirer  might  not  know  precisely  what  it  is 
that  he  wants.  The  selected  characteristics  then  only  provide  the  re¬ 
trieval  system  with  a  clue  to  the  actual  concept  being  sought. 

This  seems  to  suggest  that  the  document  space  D  is  really  a  sub¬ 
space  of  the  prescription  space  P.  Since  the  inquirerTs  input  language 
is  to  be  his  own  natural  language,  the  retrieval  system  should  then  be 
capable  of  inducing  the  same  sort  of  structure  in  the  prescription 
space  P  as  the  document  characterization  process  induces  in  the  docu¬ 
ment  space  D.  Therefore,  the  retrieval  component  will  require  the 
same  sort  of  syntactic  and  semantic  analysis  scheme  on  the  natural  lan¬ 
guage  prescription  to  isolace  the  "proto -cone ept"  characteristics  in 
the  prescription  as  was  employed  in  the  document  characterization 
process. 

The  final  step  in  the  construction  of  the  retrieval  component  is  to 
describe  the  retrieval  response  transformations.  They  will  be  a  family 
of  transformations  that  will  perform  the  following  tasks: 

1.  Define  the  inquirer’s  prescription  via  direct  inquirer- 
retrieval  system  interaction.  This  is  the  process  of 
transforming  the  inquirer's  proto-concept  into  a  con¬ 
cept  defined  in  the  retrieval  system; 

2.  Expand  or  narrow  the  retrieved  document  set; 

3.  Permit  browsability  within  either  documents  or  concept 
structures. 


Transformations  which  will  effectively  perform  the  retrieval  opera¬ 
tions  defined  in  (2)  are  respectively  the  double  p suedo -comp lement  and 
the  double  Brouwerian  complement  operators  as  described  by  Fairthorne  [83, 
They  can  be  defined  as  follows: 

DcJ . ;  The  double-psuedo -complement.  A**,  of  a  set  A  is 
the  smallest  set  that  contains  all  documents  in¬ 
dexed  by  A,  but  not  only  by  A. 

Def .:  The  double  Brouv.’erlan  complement,  77  A,  of  a  set  A, 
is  the  largest  set  of  documents  containing  nothing 
but  A,  but  conce.V.'ably  not  all  of  the  documents 
containing  A. 
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The  transformation  that  will  permit  browsability  is  one  which,  in 
effect,  maps  a  characteristic  into  some  genus  at  a  given  point  and 
produces  those  sentences  in  a  particular  document  (or  a  set  of  docu¬ 
ments)  in  which  the  given  characteristic  occurs. 

The  process  of  defining  the  inquirer’s  prescription  is  a  more 
difficult  one  and  it  requires  a  high  degree  of  inquirer  retrieval  system 
interaction.  The  inquirer’s  response  initially  to  a  matching  process 
in  which  a  genus  is  located  as  being  relevant  to  at  least  one  of  the 
terms  of  the  prescription.  The  retrieval  system’s  response  is  a  set  of 
connected  terms  from  a  genus  that  are  closely  associated  with  at  least 
one  of  the  terms  in  the  prescription.  The  inquirer  then  responds  with 
a  reformulation  of  his  original  request  and  the  process  is  repeated. 

This  interaction  then  isolates  or  redefines  the  inquirer’s  prescription 
in  terms  compatible  with  the  retrieval  system  without  necessarily 
modifying  his  proto-concept. 

Now  that  the  structure  of  the  retrieval  component  has  been  deter¬ 
mined,  all  that  remains  is  to  describe  its  implementation. 


IMPLEMENTATION  OF  THE  INDUCED  RETRIEVAL  MODEL 

In  this  section,  a  natural  language  retrieval  system  and  its  vari¬ 
ous  related  components  will  be  described. 

The  Retrieval  Scheme 

It  is  now  possible  to  describe  a  retrieval  scheme  that  will  permit 
natural  language  man-machine  communication  in  a  document  retrieval  sys¬ 
tem,  as  well  as  browsability  in  that  system. 

In  a  document  retrieval  system  it  must  be  agreed  that  generally 
such  a  system  should  provide  documents  as  output  to  a  given  inquiry. 
However,  the  system  will  not  be  expected  to  provide  facta  as  answers  to 
a  query.  Therefore,  this  document  retrieval  system  will  not  accept  as 
valid  any  queries  which  are  of  an  interrogative  nature.  That  is,  it 
will  reject  all  queries  in  the  form  of  a  question.  The  reason  for  this 
restriction  on  the  query  structure  is  that  questions  usually  require 
facts  as  answers,  and  3ince  this  is  a  document  retrieval  system  and  not 
a  fact  retrieval  system,  questions  will  be  rejected  as  queries. 
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The  browsability  feature  is  made  possible  in  this  system  as  a  re¬ 
sult  of  the  text  processing  scheme  described  earlier.  The  associations 
generated  by  this  process  permit  association  maps  to  be  constructed  for 
the  various  genera  defined  by  the  document  collection,  over  which  a 
query  map  is  superimposed  to  isolate  any  relevant  documents.  As  a 
further  convenience  to  the  inquirer,  the  b reusability  feature  permits 
to  scan  the  text  of  any  document  which  he  feels  is  interesting.  The 
system  displays  various  central  topics  sentences  of  those  documents  re¬ 
quested  by  the  inquirer.  He  can  narrow  or  e;;pand  this  display  simply 
by  requesting  that  this  be  done. 

A  valid  form  of  prescription  is  any  sentence  describing  the  topics 
which  the  inquirer  is  interested  in.  For  example,  a  valid  prescription 
could  be  the  following: 

5iI  am  interested  in  learning  something  about  lattice 
structures  and  their  relation  to  document  retrieval 
systems.  I  am  particularly  interested  in  lattice 
structures  and  their  relation  to  the  retrieval  com¬ 
ponent  of  a  document  retrieval  system.5' 

This  prescription  is,  for  all  practical  purposes,  a  demand  made  by  the 
inquirer  on  the  retrieval  system  to  produce  any  documents  which  are 
about  lattice  structures  and  document  retrieval  systems .  More  precise¬ 
ly,  about  lattice  structures  and  the  retrieval  component  of  a  document 
retrieval  system.  The  result  of  the  syntactic  analysis  of  the  retriev¬ 
al  prescription  would  be  just  those  noun  phrases  underlined  in  the 
discussion  above.  Since  each  sentence  in  the  prescription  would  be 
treated  as  a  distinct  document,  a  connectivity  structure  for  the  pre¬ 
scription  would  be  induced  by  the  prescription  (document)  characteriza¬ 
tion  process  when  applied  to  the  prescription.  The  prescription 
characteristics  along  v>ith  their  respective  connectivity  measures  are 
given  below. 


prescription 

characteristic 


connectivity 

measure 


lattice  structures 

document  retrieval  systems 

retrieval  component  of  a 
document  retrieval  system 


16/32 

8/32 

8/32 
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This  induced  structuring  of  the  prescription  permits  a  definite  order- 
ing  or  ranking  of  the  output  documents  as  implied  by  the  inquirer  in 
his  prescription. 

There  are  of  course  other  methods  of  deriving  '’index  terms  from 
the  prescription.  For  example,  Luhn's  [93  V  £EC  indexing  scheme  could 
be  applied  to  select  the  keywords  in  the  pi.  icription.  However,  there 
are  two  important  disadvantages  in  using  the  KWIC  indexing  scheme.  In 
the  first  place,  the  KOTO  index  terms  probably  would  not  be  compatible 
with  the  system’s  document  characteristics.  Secondly,  no  implicit 
ranking  scheme  can  be  derived  from  the  prescription  for  use  in  output 
ranking  of  retrieval  documents.  Thus  the  syntactic  analysis  approar" 
[103,  at  least  for  this  retrieval  system  seems  most  appropriate. 
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The  output  from  the  syntactic  analysis  of  the  query  becomes  the 
search  request  for  first-level  document  search.  First,  an  attempt  is 
made  to  determine  if  the  query  characteristics  are  in  the  document 
characteristic  table.  The  matching  scheme  used  here  should  have  the 
list  processing  capabilities  inherent  in  either  the  LISP  [113  or 
SNOBOL  [12  3  compiler  languages.  The  reason  for  such  a  capability  is 
to  permit  the  system  to  match  single  portions  of  the  query  character¬ 
istic  with  the  document  characteristics.  This  is  necessary  since  a 
query  characteristic  would  rarely  have  the  same  word  composition  as  the 
document  characteristic.  The  matching  scheme  isolates  the  various 
genera  connected  with  the  query  characteristics.  Once  this  is  known, 
the  system  determines  the  most  closely  related  terms  in  the  respective 
genus  and  displays  them  to  the  inquirer.  This  affords  the  inquirer  the 
capability  of  selecting  actual  document  characteristics  which  are  rele¬ 
vant  to  his  request.  The  document  characteristics  which  the  inquirer 
feels  are  most  relevant  to  his  need  are  then  re-entered  into  the  system 
to  be  used  in  document  selection.  If  the  query  characteristics  did 
happen  to  match  directly  with  various  document  characteristics,  the 
system  would  proceed  to  select  relevant  documents. 

At  this  point,  the  system  determines  the  relevant  documents  first 
by  scanning  the  Inverted  File  which  lists  those  documents  referenced  by 
the  particular  query  characteristic  (actually  this  is  a  document 
characteristic).  These  documents  are  then  selected  from  the  Serial 
Fils  and  ordered  in  decreasing  order  of  relevance  to  the  query.  This 


2-14 


can  easily  be  done  since  in  the  Serial  File,  each  referencing  character¬ 
istic  is  listed  with  its  associated  connectivity  value  relative  to  that 
document.  The  document  having  the  most  relevant  query  characteristics 
and  hence  the  highest  connectivity  value  would  be  selected  as  most 
relevant  to  the  query.  Ordering  of  the  total  connectivity  values  will, 
in  effect,  order  the  retrieved  documents.  The  retrieved  documents  are 
then  displayed  to  the  inquirer  for  approval.  Here  author-title  infor¬ 
mation  is  displayed  in  place  of  the  actual  document.  The  inquirer  can 
then  decide  on  which  documents  interest  him,  if  any  at  all.  If  too 
many  documents  were  retrieved,  the  inquirer  can  narrow  his  search  by 
re-selecting  more  characteristics  from  the  relevant  genus.  If  too  few 
documents  were  selected,  the  inquirer  can  expand  his  search  by  deleting 
less  important  characteristics.  In  either  case,  the  retrieval  opera¬ 
tions  described  above  are  repeated.  Thus,  the  document  retrieval 
system  is  quite  flexible  in  its  ability  to  accept  queries,  which  are  in 
a  sense,  simple  association  maps  and  superimpose  them  on  the  detailed 
association  maps  defined  within  each  genus. 

The  retrieval  system  also  provides  the  inquirer  with  the  ability  to 
browse  through  actual  document  text  if  he  so  desires.  When  the  inquirer 
has  found  a  document  (or  documents)  which  is  of  particular  interest  to 
him,  he  may  ask  the  system  to  display  for  him  those  passages  of  text 
which  are  most  central  to  the  document.  This  is  possible  since  the 
document  characteristics  are  actual  phrases  appearing  in  the  text  and 
are  central  topics  treated  in  the  document.  Therefore,  the  inquirer 
can  simply  ask  the  system  to  display  those  passages  in  a  particular 
document  by  listing  the  document  characteristics  which  are  most  appro¬ 
priate  to  his  needs.  The  system  will  display  these  passages  from  the 
document  text  file,  which  consists  of  such  passages  rather  than  full 
text. 

File  Organization 

The  system  will  require  several  kinds  of  file  structures  designed 
to  satisfy  its  various  kinds  of  data  files.  These  will  now  be  described 
in  some  detail. 
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Document  File.  The  C.I.S.  document  collection  will  be  used  as  the 
document  data  base.  These  documents  will  be  in  machine -readable  form 
and  will  reside  on  magnetic  tapes  as  an  off-line  component  of  the  file. 

Serial  File.  The  serial  file  will  be  a  disk-oriented  file  composed  of 
variable  length  unblocked  records  having  the  following  format: 


Doc. 

Doc.  Characteristics 

Author-Title 

No. 

and  their  respective 
Connectivity  Values 

Information  . 

The  variable  length  unblocked  record  scheme  was  chosen  because  :  pro¬ 
vides  the  most  economic  use  of  the  disk.  A  record  key  is  generated  to 
describe  the  record  structure  used  in  each  record  and  is  stored  as  part 
of  that  record.  This  structuring  depends  on  the  disk  storage  device 
being  used  by  the  system. 

The  document  number  will  be  a  disk  address  that  is  derived  from  the 
record  key  assigned  to  the  block  structure.  A  mathematical  transforma¬ 
tion  or  algorithm  generates  a  numerical  address  at  which  the  record  is 
stored.  No  index  is  required  to  determine  the  location  of  any  record 
in  the  file.  Generally,  however,  extensive  analysis  of  the  record  key 
structure  and  range  is  necessary  to  implement  such  a  randomly  organized 
file.  The  ideal  routine  for  record  key  conversion  produces  a  unique 
storage  address  for  every  record  in  a  file.  In  this  case  a  unique 
storage  address  is  possible  since  there  are  only  2,500  such  records. 

A  simple  conversion  method  that  is  flexible  and  easy  to  implement  is 
the  divide -remainder  technique,  whereby  a  record  key  is  divided  by  a 
number  selected  to  produce  a  quotient  and  a  remainder.  The  quotient  is 
discarded,  and  the  remainder  becomes  the  disk  record  storage  address. 
The  divisor  should  be  either  a  prime  number  or  a  number  ending  in  1,  3, 
7,  or  9.  This  usually  results  in  relatively  few  duplicate  remainders, 
and  thus  relatively  few  address  synonyms.  The  logical  choice  for  a 
divisor  is  a  prime  number  slightly  less  than  the  number  of  storage 
units  allocated  for  the  given  data  file.  Tentative  choices  for  a 
divisor  may  be  tested  in  the  key  transformation  routine  to  determine 
which  produces  the  fewest  synonyms. 

The  second  part  of  the  document  record  (serial  file)  consists  of 
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document  characteristic  numbers  along  with  their  respective  lines  of 
connectivity  value  for  the  given  document.  Each  document  will  be 
allowed  up  to  10  document  characteristics,  each  characteristic  consist¬ 
ing  of  an  average  of  3  English  words. 

The  third  section  of  the  document  record  is  made  up  of  author-title 
information  to  be  u.ed  in  retrieval  returns  in  place  of  document  num¬ 
bers. 

Inverted  File.  This  file  is  a  disk-oriented  term-document  file  in 
which  each  record  consists  of  a  term  number  and  the  affiliated  document 
numbers  of  those  documents  characterized  by  the  given  term.  The  char¬ 
acteristics  are  organized  into  genera  which  were  defined  during  the 
initial  text  processing  of  the  document  collection.  In  this  case,  the 
file  is  in  fixed-length  blocked  record  format,  h  key  provides  the 
location  of  each  genus  in  the  file  for  fast  access  of  these  segments  of 
the  file.  The  inverted  file  will  probably  consist  of  around  5,000  docu¬ 
ment  characteristics. 

The  Dictionary.  The  dictionary  consists  of  the  functors  and  suffixes 
that  are  used  by  the  syntactic  analyzer.  I-  will  also  reside  on  the 
disk  in  a  fixed-length  blocked  record  format. 

Systems  Programs .  All  of  the  systems  programs  will  reside  on  the  disk 
and  will  be  called  by  e  program  monitor  that  will  reside  in  core  memory. 
The  programs  are  those  required  by  the  document  retrieval  system  such 
as  the  syntactic  analyzer,  general  retrieval  programs,  browsing  pro¬ 
grams,  and  system  update  programs.  The  following  diagrams  display  the 
flow  of  the  retrieval  operations  and  the  various  options  which  are 
possible. 
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summary 

In  the  foregoing  discussion,  an  attempt  was  made  to  construct  a 
document  retrieval  system  capable  of  natural  language  communication  via 
man-machine  interaction.  The  document  base  selected  was  the  C.I.S. 
collection  of  documents  related  to  Information  Science.  The  documents 
were  syntactically  analyzed  to  generate  highly  associated  document 
characteristics  which  were  used  to  index  the  documents.  The  required 
data  files  were  also  described  and  algoritlims  for  assigning  disk 
addresses  to  text  records  were  proposed.  After  the  preliminaries  of 
text  processing  were  given,  the  actual  document  retrieval  operations 
were  defined  and  the  various  retrieval  options  including  browsability 
were  outlined. 

At  this  point,  1  would  like  to  make  a  few  brief  comments  regarding 
the  feasibility  of  such  a  document  retrieval  system.  In  terms  of 
hardware,  the  system  would  require  at  least  an  on-line  processing 
capability  with  remote  terminal  access.  A  time-shared  computer  system 
such  as  an  IBM-360/67  or  a  GE-645,  or  in  fact,  any  comparable  process¬ 
ing  system  would  suffice.  A  main  core  memory  of  no  less  than  8K  words 
would  be  necessary  for  operating  programs  and  for  data  space.  As 
secondary  storage,  a  large  capacity  disk  storage  unit  would  best  satis¬ 
fy  the  system's  needs.  For  example,  if  we  consider  an  IBM-2302/4  disk 
storage  unit,  which  has  a  data  capacity  of  224,280,000  bytes  (or  alpha¬ 
numeric  characters),  we  could  get  some  idea  of  how  much  space  the  re¬ 
trieval  system  would  require. 


File 

Millions  Bytes 

No.  of  Disks 

Document  Text  File 

130 

35 

Central  Topic  File 

25 

6 

Serial  File 

2 

.5 

Inverted  File 

.1 

.025 

Term  Table 

.25 

.05 

System  Programs 

.5 

.1 

The  estimates  for  the  numbers  of  bytes  in  each  file  is  based  on  an 
assumed  average  of  six  characters  per  English  word.  For  example,  if 
it  is  assumed  that  an  average  document  in  the  C.I.S.  file  contains 
10, non  woids  and  t-heie  av"  9,lfi0  documents,  then  the  total  number  of 
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characters  (or  bytes)  in  this  document  collection  would  be  about  126 
million. 

Although  it  is  apparent  that  the  hardware  exists  and  is  available, 
it  by  no  means  iitplies  that  such  a  retrieval  system  could  be  imple¬ 
mented  economically  on  such  apparatus.  However,  it  does  seem  reasona¬ 
ble  to  expect  that  the  average  cost  per  query  would  not  be  prohibitively 
high  since  the  amount  of  processing  time  for  each  request  would  be  low 
as  a  result  of  the  very  fast  processors  in  time-shared  systems. 
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Dictionary  Program  and  Listings 

Input  to  the  dictionary  program  is  presently  by  cards.  Periods  are 
used  only  to  mark  end  of  sentences  and  are  punched  as  separate  words. 
Commas  are  also  punched  as  separate  words. 

The  program  first  compares  each  word  or  text  with  the  functor  word 
dictionary  listed  in  Table  A  below.  If  a  match  is  found,  the  text  word 
is  assigned  the  category  listed  for  it  in  the  dictionary.  If  no  match 
is  found,  the  program  then  compares  the  text  word  with  the  two  suffix 
dictionaries . 

For  a  suffix  search,  all  final  s’s  are  deleted  from  the  text  word. 
This  was  done  to  shorten  the  suffix  dictionaries  and  to  facilitate 
programming.  In  the  first  suffix  search  the  last  two  letters  of  the 
text  word  (minus  s’s)  are  compared  with  the  two  letter  suffixes  listed 
in  Table  B  below.  If  no  match  is  found,  then  the  last  three  letters  of 
the  text  word  are  compared  with  the  three  letter  suffixes  listed  in 
Table  C  below.  If  a  match  is  found  in  either  dictionary,  the  corre¬ 
sponding  category  is  assigned  to  the  text  word.  If  no  match  is  found 
in  any  of  the  dictionaries,  the  word  is  assigned  the  type  *11’  for  un¬ 
known. 

The  text  words  followed  by  their  categories  form  the  output  on 
tape.  A  period  is  assigned  the  category  which  functions  as  an 

end  of  sentence  tag  for  the  other  programs. 
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table  A 

Fonctor  Word  Dictionary  (in  machine  alphabetic  order) 


Word 

Category 

Word 

Category 

About 

P 

Could 

AV 

Above 

P 

Didn’t 

AV 

Across 

P 

Did 

AV 

Adequate 

A 

Doesn’t 

AV 

After 

P 

Does 

AV 

Again 

B 

Don’t 

AV 

Against 

P 

Do 

AV 

All 

A 

Each 

A 

Along 

P 

Either 

C 

Also 

B 

Else 

N 

Always 

B 

Everybody 

N 

Among 

P 

Everyone 

N 

Am 

AV 

Everything 

N 

And 

Cl 

Except 

P 

An 

ART 

Fail 

V 

Anybody 

N 

Pails 

V 

Anyone 

N 

Fewer 

A 

Any 

A 

Fewest 

A 

Anything 

N 

Few 

A 

Apart 

P 

First 

B 

Apply 

V 

For 

P 

Arentt 

AV 

From 

P 

Are 

AV 

Hadn’t 

AV 

Around 

P 

Had 

AV 

A 

ART 

Hasn’t 

AV 

As 

C3 

Has 

AV 

At 

P 

Haven’t 

AV 

Away 

B 

Have 

AV 

Back 

P 

Having 

AV 

Because 

C 

Hence 

B 

Been 

AV 

Her 

H 

Before 

P 

Here 

B 

Behind 

P 

Hers 

A 

Being 

AV 

Herself 

N 

Below 

P 

He 

N 

Be 

AV 

Him 

N 

Beside 

P 

Himself 

H 

Between 

P 

His 

A 

Beyond 

P 

However 

B 

Both 

C2 

How 

B 

But 

C 

If 

C 

By 

P 

Including 

Cl 

Can’t 

AV 

In 

P 

Cannot 

AV 

Inside 

P 

Can 

AV 

Into 

P 

Couldn’t 

AV 

I 

N 

Vferd 

Caceocry 

Isn’t 

AV 

Is 

AV 

It 

N 

Its 

X 

Itself 

N 

• 

Many 

A 

May 

AV 

Mention 

V 

Me 

N 

Might 

AV 

Mo  re 

A 

Mustn 1 t 

AV 

Must 

AV 

My 

A 

Myself 

N 

Near 

P 

Heed 

N 

Heeds 

N 

Nobody 

N 

None 

N 

Nor 

c 

No 

A 

Not 

B 

Now 

B 

Of 

P 

Only 

B 

On 

P 

Or 

Cl 

Other 

A 

Others 

N 

Oughtn’t 

AV 

Ought 

AV 

Our 

A 

Ours 

A 

Out 

B 

Outside 

P 

Over 

P 

Own 

A 

Per 

P 

Rate 

N 

Rates 

N 

Bather 

C 

She 

N 

Shouldn’t 

AV 

Should 

AV 

Since 

C 

Somebody 

H 

Someone 

N 

Some 

N 

So 

B 

Something 

N 

Word 

Ca  t  e  go  ry 

Sometimes 

Vl 

Somewhat 

A 

Still 

B 

Such 

A 

Than 

C2 

That 

T 

Their 

A 

Them 

N 

Then 

C 

Therefore 

B 

There 

N 

The 

ART 

These 

N 

They 

N 

This 

N 

Those 

N 

Too 

B 

To 

P 

Toward 

P 

Under 

P 

Until 

K 

Upon 

P 

Up 

P 

Us 

N 

Very 

B 

Via 

P 

Wasn’t 

AV 

Was 

AV 

Weren’t 

AV 

Were 

AV 

We 

N 

When 

C 

Where 

C 

Which 

K 

While 

C 

Whoever 

N 

Whom 

K 

Who 

K 

Whose 

K 

Why 

B 

Will 

AV 

Within 

P 

Without 

P 

With 

P 

Wouldn’t 

AV 

Would 

AV 

Your 

A 

Yours 

A 

Yourself 

N 

You 

N 

Cl 
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Table  E 


Two  Letter  Suffix  Dictionary 


M  f  f  2,  X 


Category 


CY 

ED 

ER 

IC 

LY 

UM 

US 


N 

VD 

N 

A 

B 

13 

N 


Table  C 


Suffix 

AGE 

AIN 

ANT 

APH 

ARD 

ARY 

ATE 

BLE 

CIE 

CUR 

DOM 

ECT 

EDE 

EED 

ENT 

ERY 

EST 

ETH 

ETY 

EVE 

FIE 

FUI 

GIE 

GUE 

HER 

HIP 

IAN 

IAR 

IBE 

IER 

IFY 

ILE 

INE 

ING 


Three  Letter  Suffix  Dictionary 


Category 


Suffix 


N 

V 
N 
N 
A 
A 

V 
A 
N 

V 
N 

V 

V 

V 
A 
N 
A 
N 
N 

V 

V 
A 
N 
N 
N 
N 
N 
A 

V 
N 

V 
N 
N 

VP 


ION 

ISE 

ISH 

ISM 

1ST 

ITE 

TTY 

IVE 

IZE 

LAR 

LTE 

LOG 

MU 

NAR 

NCE 

CGY 

OID 

OLY 

ORY 

OSE 

OUS 

PEL 

RAM 

RIE 

SAL 

SCE 

SHE 

THE 

TIE 

TOR 

TRY 

UCE 

UDE 

URE 


Category 
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APPENDIX  II 

The  following  output  is  an  exampxe  of  the  form  of  output  produced 
by  the  syntactic  analyter  used  in  text  processing. 
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The  following  and  their  trauss  £ o rations  are  examples  of  the 

connect ivity  assignment  procedures  that  arc  used  m  the  text  analysis 
operations , 

To  illustrate  the  document  characterization  process  let 
C  *  u,U, V,W, X,Y,Z  be  the  document  collection  and  suppose  that 
P  »  a,b,c,d,e,f,g,h,i,  j,k,l>m,n  is  the  repertory  of  document  character¬ 
istics  for  the  document  collection  C.  Suppose  also  that  the  documents 
in  C  have  the  following  relational  structure: 

S:  {Rj.Cc),  R2(d),  R3(g,j)J 
Us  (R4(a,h),  R^j)  ) 

V:  (R6(h,n),  R7(h,l),  R^n)} 

Ws  (RgCa),  Rl0(k)3 
X:  {Rjj(i,l,n)»  R-^Cl.n)} 

V:  (R13Cb),  R^Cf)  ) 

Z:  (R15(e),  Rj^gCe^m)} 

The  affiliation  matrix,  the  incidence  matrix,  and  the  submatrices 
resulting  from  the  partitioning  of  the  incidence  matrix  along  with 
their  respective  unique  probability  vectors  are  shown  on  the  following 
three  pages. 
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Abstract 


In  Part  I,  a  manual  indexing  system,  using  phrases  rather  than 
uniterms  or  descriptors,  is  developed  and  evaluated  in  terms  of  certain 
assumptions  about  user  oriented  systems.  Part- II  deals -with  retrieval 
operations  for  manually  and  automatically  phrase  indexed  systems. 
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PART  I.  Introduction 


The  major  guiding  principle  in  the  conception  and  design  of  this  IR 
system  is  to  produce  a  user-oriented  system.  It  is  assumed  that  in  a 
user-oriented  system  that  the  user  should  not  be  expected  to  know  about 
the  technicalities  of  the  system,  that  he  should  not  be  forced  to  ex¬ 
press  his  request  in  very  restricted  and  structured  vocabulary  and  form, 
and  that  his  interface  with  the  system  should  be  in  normally  meaningful 
natural  language. 

To  satisfy  these  assumptions  indexing  of  documents  in  the  system  is 
done  by  assigning  short  descriptive  phrases  to  documents.  Phrases 
rather  than  uniterms  or  descriptors  are  used  because  uniterms  and 
descriptors  by  themselves  have  little  definite  meaning  to  someone  not 
well  acquainted  with  the  particular  system  at  hand.  It  is  felt  that 
phrases  provide  contextual  meaning  for  terms  embedded  in  them  and  that 
this  contextual  meaning  of  terms  will  provide  for  meaningful  interface 
between  the  system  and  the  user. 

The  generation  of  these  descriptive  phrases  by  the  indexer  is 
governed  by  a  semi -controlled  vocabulary,  possibly  a  few  syntactic 
constraints,  and  an  acceptability  check  by  the  system.  The  indexer  is 
given  a  basic  word  vocabulary  consisting  of  the  general  vocabulary  of 
the  field  of  the  documents.  The  indexer  has  the  liberty  to  augment  the 
basic  word  vocabulary  with  other  words  for  the  purposes  of  additional 
qualification,  e.g.,  ir.  the  phrase  ,,Goodman,s  concept  of  relevance,” 
'•'concept*'  and  "relevance"  would  be  basic  vocabulary  and  "Goodman" 
augmented  vocabulary.  The  system  is  so  designed  that  augmented  vocabu¬ 
lary  will  not  have  adverse  effects  on  the  system* s  performance . 

There  are  no  theoretical  restrictions  on  the  number  of  phrases 
that  can  be  assigned  to  a  document.  Because  of  this  and  other  factors 
discussed  below,  this  system  has  parallels  with  Hillman’s  document 
characterization  system  [13 .  Obviously  the  more  phrases  assigned  per 
document  the  more  detailed  the  indexing  is. 

Syntactic  constraints  on  indexing  phrases  and  the  indexing  accepta¬ 
bility  check  by  the  system  are  discussed  below. 


3-2 


The  indexing  phrases  are  interpreted  by  the  system  as  graph  con¬ 
nected  word  strings.  Each  non-trivial  word  in  an  indexing  phrase  is 
taken  to  be  a  graph  node,  A  phrase  is  represented  as  a  connected  path 
in  linear  order  between  the  words  in  the  phrase.  The  graph  is  then  rep¬ 
resented  in  terms  of  a  matrix. 


The  structuring  of  indexing  phrases  in  terms  of  graph  theoretic 
concepts  was  chosen  on  the  basis  of  the  following  considerations: 


1.  It  is  desirable  to  structure  the  indexing  phrases  in  a 
form  which  captures  the  maximal  structure  of  the  phrases 
themselves  because  it  is  assumed  to  be  desirable  not  to 
impose  any  more  structure  upon  them  than  they  actually 
possess.  It  is  felt  that  by  structuring  the  phrases  in 
the  most  general  form  they  possess  the  system  does  not 
make  a  priori  commitments  to  any  particular  theoretical 
structure  the  phrases  might  be  thought  to  have .  Since 
the  system  does  not  incorporate  at  the  grass  roots  level 
any  structural  model  except  a  most  general  weak  one,  one 
is  left  free  in  the  retrieval  processes  to  impose  upon 
the  phrases  a  wide  choice  of  structural  models.  Retriev¬ 
al  processes  in  the  system  thus  can  manipulate  the 
phrases  as  is  pragmatically  useful  because  of  the  lack  of 
any  strong  structuring  in  the  storage  of  the  indexing 
phrases.  Since  the  phrases  are  interpreted  in  a  way  that 
captures  only  their  most  general  structure,  the  structural 
assumptions  of  any  particular  retrieval  process  remain 
explicitly  clear  and  unconfused  with  the  structure  of  the 
indexing  phrases. 

2.  It  is  believed  that  a  graph  interpretation  of  the  index¬ 
ing  phrases  provides  all  the  structure  necessary  for  a 
retrieval  system  to  satisfy  the  assumptions  made  at  the 
beginning  of  this  paper  about  user-oriented  systems. 

3.  This  technique  of  structuring  the  indexing  phrases  super¬ 
imposes  a  word-word  matrix  on  a  phrase-phrase  matrix 
similar  to  the  term-term  matrix  generated  by  Hillman’s 
technique  [2].  Hillman’s  process  generates  a  phrase- 
phrase  matrix  based  on  phrase  co-occurrence  and  a  weight¬ 
ing  factor.  All  phrases  used  to  characterize  a 
particular  document  with  manual  indexing  as  well  as  in 
Hillman's  system  of  document  characterization  are  bound 
in  the  same  genus.  All  the  phrases  of  other  documents 
which  have  at  least  one  phrase  in  common  with  this  one 
document  are  also  bound  in  the  same  genus.  Although  this 
process  groups  together  conceptually  related  documents  by 
grouping  together  their  characteristics,  it  does  not  pro¬ 
vide  for  user-oriented  retrieval  techniques  since  there  is 
little  likelihood  that  a  user  will  give  in  his  request  an 
exact  phrase  which  was  used  to  characterize  a  document. 
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However  a  word-word  matrix  superimposed  on  the  phrase- 
phrase  matrix  allows  retrieval  techniques  which  lead  the 
user  heuristically  to  the  manually  produced  phrases  or 
the  source  oriented  phrases  of  Hillman's  system  which 
are  appropriate  to  the  user's  request;  statements* 

Part  II  of  this  paper  develops  this  system  in  detail.  It  deals  ex¬ 
plicitly  with  retrieval  operation  processes  for  a  document  collection 
characterised  by  Hillman's  methods.  However  in  this  part  of  the  paper 
I  have  dealt  with  a  retrieval  system  in  which  document  characterisations 
are  manually  produced.  Given  documents  manually  indexed  with  descrip¬ 
tive  phrases  a  term -term  (i.e.  phrase-phrase)  matrix  ard  a  term 
(phrase) -document  matrix  can  be  generated  and  these  matrixes  form  the 
Input  to  that  part  of  the  system  described  in  Part  II.  I  believe  that 
everything  in  Part  H  is  applicable  to  manual  indexing  systems  of  the 
type  discussed  above  as  well  as  to  Hillman's  system.  In  addition  to 
the  considerations  in  Part  II  there  are  two  other  facets  of  such  manual 
indexing  systems  which  are  particularly  significant  for  system  design 
and  evaluation. 

Since  in  manual  systems  the  syntax  of  the  indexing  descriptive 
phrases  is  more  controllable  than  in  automatic  systems,  the  utility  of 
restrictive  or  special  meaning  phrase  syntax  can  be  more  fully 
examined.  (See  Part  U) 

In  a  manual  indexing  system  the  retrieval  operation  procedures  can 
provide  an  acceptability  check  on  the  indexing.  This  acceptability 
check  can  evaluate  the  effect  of  new  phrases  on  the  structure  of  the 
term-term  matrix  and  the  word-word  matrix.  If  a  set  of  new  indexing 
phrases  links  previous  distinct  genera  (partitions)  of  the  term-term 
matrix,  the  indexer,  using  the  browsing  operations  described  in  Part 
II,  can  determine  if  the  new  documents  as  indexed  really  relate  the 
two  different  conceptual  areas  of  the  distinct  genera  and/or  if  it  is 
desirable  from  a  subject  matter  point  of  view  to  form  the  two  genera 
into  one  genus.  This  check  thus  allows  manual  structuring  of  the  con¬ 
ceptual  areas  defined  by  genera. 

Operations  with  the  word-word  matrix  can  present  the  indexer  with 
a  resume  of  the  contextual  meaning  previously  given  to  the  indexing 
vocabulary.  The  indexer  thus  can  determine  if  a  set  of  new  indexing 


phrases  employs  vocabulary  consistent  with  past  use  of  vocabulary.  If 
it  is  found  desirable  to  modify  the  meaning  of  vocabulary  words,  the 
system  can  readily  present  the  past  usage  of  selected  words  for  appro¬ 
priate  decisions.  Such  modifications  in  vocabulary  usage  will  require 
of  course  the  regeneration  of  the  files  used  for  retrieval  operations 
but  nc  change  in  retrieval  operations.  Additions  to  the  indexing  vo¬ 
cabulary  can  be  checked  for  subject  matter  consistency  by  examination 
of  the  genera  in  which  the  system  places  the  phrases  containing  the  new 
vocabulary. 


PART  II. 


A •  Preliminaries 

The  text  processing  and  matrix  operations  described  above  yield  a 
document-term  matrix  and  a  partitioned  term-term  matrix  which  together 
characterize  a  document  collection.  To  utilize  this  characterization 
fully  for  document  retrieval  it  is  necessary  to  develop  retrieval  tech¬ 
niques  which  make  optimal  use  of  all  of  the  information  about  the 
system’s  documents  contained  in  these  matrices.  Since  there  is  little 
likelihood  that  a  user  will  employ  exactly  the  same  terms  in  requests 
as  the  document  characterization  procedure  or  the  indexer  selects  to 
characterize  documents*  the  problem  of  interface  between  user  and  system 
is  of  central  importance  for  making  optimal  use  of  the  matrices.  To 
make  minimal  demands  upon  the  user  a  word-word  matrix*  for  each  non¬ 
trivial  word  occurring  in  document  characterizations,  is  generated  and 
is  used  in  retrieval  operations  to  heuristically  lead  the  user  from  the 
terminology  of  his  request  statements  to  the  source  derived  or  indexer 
supplied  terminology  of  the  document  characterization. 

It  is  believed  that  this  technique  as  discussed  in  the  next  two 
sections  will  successfully  allow  useful  and  meaningful  interface  between 
the  user  and  the  information  the  system  contains  about  its  documents 
and  will  allow  the  user  to  make  full  use  of  document  characterization 
associations  in  a  intuitively  intelligent  manner. 

An  attempt  has  been  made  in  the  following  discussion  to  delineate 
the  generation  and  structure  of  data  files  and  the  retrieval  operation 
procedures  in  a  fashion  which  is  easily  programmed.  Flowcharts  of 
proposed  programs  for  these  are  below.  The  system  configurations 
needed  are  indicated  on  the  flowcharts. 

B.  Generation  of  Files  for  Retrieval  Purposes 

Input  to  these  programs  consists  of  the  partitioned  term-term 
matrices  and  the  document-term  matrix  described  above.  The  terms  of 
the  matrices  are  source  derived  phrases.  Each  phrase  is  assigned  a 
unique  code  number  which  is  called  a  "String  Number."  A  double  entry 
dictionary  is  constructed  containing  string  numbers  and  their  correlate 
phrases.  Following  each  phrase  is  its  entry  from  the  document -term 


matrix  which  contains  the  documents  the  phrase  characterizes  and  its 
width  in  each  of  these  documents.  This  dictionary  is  called  the  "String 
Number-Phrase  Dictionary." 
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Figure  1.  String  Number-Phrase  Dictionary 


Each  genus  in  the  term-term  matrices  is  assigned  a  unique  number  and 
File  A  is  generated.  File  A  consists  of  the  String  Number  of  a  phrase 
and  the  Genus  Number  of  the  genus  in  which  it  occurs. 


STRING  NUMBER 
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Figure  2.  File  A 


File  A  and  the  String  Number-Phrase  Dictionary  are  merged  to  gener¬ 
ate  File  B.  The  String  Number-Phrase  Dictionary  is  saved  for  further 
use.  File  A  is  discarded. 


SOURCE  DERIVED  PHRASE 


GENUS  NUMBER 


STRING  NUMBER 


Figure  3.  File  B 


From  File  B  is  generated  File  C  which  consists  of,  for  each  non¬ 
trivial  word  in  each  Source  Derived  Phrase  in  File  B,  that  word  followed 
by  its  phrase’s  Genus  Number  and  String  Number.  File  B  is  discarded. 
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Figure  4.  File  C 


File  C  is  sorted  alphabetically  on  the  words  and  file  entities  with 
identical  words  are  combined  to  produce  the  Word  Profile  File  as  in 
Figure  5.  File  C  is  discarded. 
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Figure  5.  Word  Profile  File 


The  input  partitioned  term-term  matrices  are  rewritten  using  the 
String  Number-Source  Phrase  Dictionary  to  produce  the  String  Association 
Matrix  File.  The  input  matrices  are  now  discarded. 
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Figure  6.  String  Association  Matrix  File 


This  file  is  ordered  by  Genus  Number  and  String  Number. 

A  subsidiary  file  needed  for  document  identification  is  a  Document 
Number  -  Bibliographic  Data  Dictionary  which  contains  the  title,  author, 
source,  etc.  identification  of  documents.  (See  Figure  7) 

Figure  7  contains  the  complete  set  of  files  used  for  retrieval 
purposes . 
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c*  Outline  of  Retrieval  Operations 

The  following  retrieval  operations  are  designed  to  be  as  user- 
oriented  as  possible.  It  is  assumed  that  it  is  desirable  in  a  user- 
oriented  system  not  to  force  the  user  to  express  his  initial  request  in 
a  very  restricted  vocabulary  and  form.  The  user  therefore  submits  his 
initial  request  in  the  form  of  a  phrase  describing  what  he  wishes  to 
find  a  document  about.  He  may  use  any  vocabulary  he  wishes.  The  system 
heuristically  develops  his  initial  request  through  interface  with  the 
user  to  obtain  source  derived  phrases  which  the  user  indicates  as 
elaborations  and  refinements  of  his  initial  request.  The  user  can  be 
lead  by  the  system  to  browse  in  the  general  area  of  his  request  and  to 
broaden  or  narrow  his  request  as  he  wishes.  The  user  may  consult  docu¬ 
ments  related  to  his  request  and  by  accepting  and  rejecting  them  modify 
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his  description  of  what  he  is  looking  for. 

Interface  between  the  system  and  user  is  in  natural  language  and 
the  user  is  not  required  to  know  about  the  technicalities -of  the  sys¬ 
tem.  It  is  felt  that  these  factors  are  highly  desirable. 

In. the  actual  operation  of  the  system  there  are  many  options  possi¬ 
ble  for  user-system  interaction  in  the  determination  of  the  flow  of  the 
retrieval  processes.  Experimental  evaluation  of  the  several  retrieval 
operation  flow  patterns  must  be  made  to  determine  the  optimal  patterns 
in  terms  of  the  user’s  satisfaction  with  retrieved  documents  and  with 
the  ease  of  interacting  with  the  system.  In  the  following  outline  of 
proposed  retrieval  operation  flow  suoh  options  will  be  noted. 

When  the  user  has  entered  his  request  in  the  form  of  a  description, 
the  request  is  scanned  for  words  which  occur  in  the  Word  Profile  File. 
It  is  an  open  question  if  a  syntactic  analysis  of  the  request  will 
furnish  any  information  which  would  improve  the  system’s  understanding 
of  the  request,  e.g..  Will  restrictive  clauses  in  the  request  usefully 
narrow  the  system’s  search?  Will  word  order  be  important?  Or  will 
later  user-system  interaction  resolve  such  problems  more  easily? 

The  profile  records  of  words  from  the  Word  Profile  File  found  in 
the  request  description  are  compared  to  determine  if  they  all  fall 
within  one  genus.  If  the  request  words  are  homogeneous  then  the  opera¬ 
tions  described  below  are  performed.  If  the  words  found  in  the  request 
are  found  in  different  genera,  the  user  is  presented  with  a  list  of 
phrases  from  the  different  genera  in  which  his  words  appear.  The  user 
is  asked  to  make  a  selection  of  the  phrase  most  pertinent  to  his  re¬ 
quest.  If  he  is  unsure  or  cannot  decide,  the  phrases  are  presented 
again  but  with  their  most  highly  associated  phrases  from  the  String 
Association  Matrix  File.  If  the  user  is  still  undecided  or  wishes  to 
explore  the  conceptual  area  of  one  choice,  associated  phrases  of  any 
phrase  are  presented  until  a  choice  of  one  is  made  or  until  a  group  of 
phrases  from  one  genus  is  selected.  The  user  may  return  to  this  point 
from  further  on  in  the  operation  if  he  feels  that  his  original  choices 
were  incorrect  or  if  he  wishes  to  explore  alternatives.  It  is  believed 
that  this  reiterative  process  will  operationally  be  of  great  value, 
allowing  the  user  to  browse  from  one  genus  to  another  as  well  as  in  a 


genus  and  enabling  the  user  to  reformulate  his  original  request  in  the 
terminology  and  phraseology  of  the  system. 

Evaluation  of  this  browsing  operation  will  concern  how  far  from  the 
original  request  it  is  practical  to  proceed  and  how  meaningful  and  use¬ 
ful  it  would  be  to  allow  the  user  to  rank  presented  phrases  as  perti¬ 
nent  to  his  wants  instead  of  either  accepting  or  rejecting  them.  Of 
over  all  interest  will  be  comparisons  of  what  the  user  thought  origi¬ 
nally  he  wanted  and  what  the  system  leads  him  to  formulate  as  his 
request,  particularly  when  his  reformulated  request  yields  documents 
with  which  he  is  satisfied.  It  is  believed  that  this  browsing  opera¬ 
tion  will  transform  whatever  formalized  "subject  heading"  nature 
initial  requests  have  to  either  the  point  of  view  of  the  human  indexer 
or  the  textual  nature  of  the  source -derived  phrases,  with  a  minimum  of 
user  discomfort  and  a  maximum  of  informative  user  interface  with  the 
system. 

It  should  be  noted  that  the  words  in  the  phrases  presented  to  the 
user  are  given  contextual  meaning  by  being  embedded  in  phrases.  This 
fact  should  insure  that  the  user  is  presented  by  the  system  with 
meaningful  expressions  and  meaningful  choices. 

The  user  arrives  at  the  next  stage  of  the  retrieval  operation  hav¬ 
ing  made  a  choice  of  a  phrase  or  of  phrases  which  he  believes  are  the 
topic  or  top.cs  he  wishes  to  have  a  document  about.  The  system  now 
presents  him  with  phrases  which  are  directly  associated  with  his  re¬ 
formulated  request  and  the  user  has  the  option  of  further  refining  his 
request  by  adding  additional  phrases  to  his  request  and/or  browsing  a 
bit. 


When  the  user  has  finally  formulated  his  request,  the  system,  using 
the  String  Number-Phrase  Dictionary  containing  document  numbers  and 
widths,  ranks  pertinent  documents  numbers  on  the  basis  of  maximal  width 
values  of  the  set  of  request  phrases.  The  user  is  given  the  ranking  of 
the  documents  and  the  necessary  bibliographic  data. 

If  the  user  wishes,  he  may  ask  for  a  display  of  selected  passages 
from  these  documents  and  determine  if  he  is  satisfied.  If  he  is  not  he 
may  re-enter  the  retrieval  operation,  indicate  those  documents  which  are 
satisfactory  and  those  which  are  not  and  browse  in  that  portion  of  the 
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genus  that  contains  the  satisfactory  documents.  These  operations  are 
undertaken  by  the  system  by  retaining  the  phrases  which  are  associated 
with  the  satisfactory  documents,  inhibiting  phrases  associated  with 
the  rejected  documents  and  returning  to  the  browsing  portion  of  the 
retrieval  operation. 

D.  Flowcharts 

The  following  flowcharts  summarise  the  generation  of  files  used 
for  retrieval  and  the  flow  of  retrieval  operations.  I  am  indebted 
to  Andrew  Xasarda  who  drafted  them  for  me. 
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